Giter VIP home page Giter VIP logo

Comments (7)

AddyLaddy avatar AddyLaddy commented on July 28, 2024

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

In addition to what David asked, could you provide the following information as well?

  1. Complete log of your run.
  2. EFA installer version. You can find this using
cat /opt/amazon/efa_installed_packages
  1. Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of aws branch?

from aws-ofi-nccl.

eric-haibin-lin avatar eric-haibin-lin commented on July 28, 2024

Thanks for the quick reply.

Does your security group include a rule for All Traffic for itself Inbound & Outbound ?

Yes.

Complete log of your run.

efa.log

EFA installer version. You can find this using

# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64

I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws branch of the plugin when running on EC2 infrastructure.

Also, does host "ip-172-32-36-209" have efa installed?

from aws-ofi-nccl.

apeforest avatar apeforest commented on July 28, 2024

I got the same error while running nccl-test

~/anaconda3/bin/mpirun \
        -x FI_PROVIDER="efa" \
        -x FI_EFA_TX_MIN_CREDITS=64 \
        -x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
        -x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
        --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
        $HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:

[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh  | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
#   Rank  0 Pid  23389 on ip-172-31-10-20 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  1 Pid  23390 on ip-172-31-10-20 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank  2 Pid  23391 on ip-172-31-10-20 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank  3 Pid  23392 on ip-172-31-10-20 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank  4 Pid  23393 on ip-172-31-10-20 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank  5 Pid  23394 on ip-172-31-10-20 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank  6 Pid  23395 on ip-172-31-10-20 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank  7 Pid  23396 on ip-172-31-10-20 device  7 [0x00] Tesla V100-SXM2-32GB
#   Rank  8 Pid  10508 on ip-172-31-1-59 device  0 [0x00] Tesla V100-SXM2-32GB
#   Rank  9 Pid  10509 on ip-172-31-1-59 device  1 [0x00] Tesla V100-SXM2-32GB
#   Rank 10 Pid  10510 on ip-172-31-1-59 device  2 [0x00] Tesla V100-SXM2-32GB
#   Rank 11 Pid  10511 on ip-172-31-1-59 device  3 [0x00] Tesla V100-SXM2-32GB
#   Rank 12 Pid  10512 on ip-172-31-1-59 device  4 [0x00] Tesla V100-SXM2-32GB
#   Rank 13 Pid  10513 on ip-172-31-1-59 device  5 [0x00] Tesla V100-SXM2-32GB
#   Rank 14 Pid  10514 on ip-172-31-1-59 device  6 [0x00] Tesla V100-SXM2-32GB
#   Rank 15 Pid  10515 on ip-172-31-1-59 device  7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0

ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'

ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[42896,1],8]
  Exit code:    3

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

This would happen if you are using the master branch. Please use aws branch when working with EFA.

from aws-ofi-nccl.

rashikakheria avatar rashikakheria commented on July 28, 2024

Please re-open if you see the issue again.

from aws-ofi-nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.