Comments (7)
Does your security group include a rule for All Traffic for itself Inbound & Outbound ?
from aws-ofi-nccl.
In addition to what David asked, could you provide the following information as well?
- Complete log of your run.
- EFA installer version. You can find this using
cat /opt/amazon/efa_installed_packages
- Which commit of aws-ofi-nccl are you using to run the tests? Is that the latest of
aws
branch?
from aws-ofi-nccl.
Thanks for the quick reply.
Does your security group include a rule for All Traffic for itself Inbound & Outbound ?
Yes.
Complete log of your run.
EFA installer version. You can find this using
# EFA installer version: 1.4.1
# Debug packages installed: no
# Packages installed:
efa_1.3.0-1.amzn1_amd64 libfabric1_1.8.0amzn1.0_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 openmpi_3.1.4-2_amd64
I was using horovod with my model training code. I was trying to compile nccl-tests but the linker complains about some missing .so files
from aws-ofi-nccl.
Could you confirm the commit that you are using for aws-ofi-nccl? Please use the latest aws
branch of the plugin when running on EC2 infrastructure.
Also, does host "ip-172-32-36-209" have efa installed?
from aws-ofi-nccl.
I got the same error while running nccl-test
~/anaconda3/bin/mpirun \
-x FI_PROVIDER="efa" \
-x FI_EFA_TX_MIN_CREDITS=64 \
-x LD_LIBRARY_PATH=$HOME/drivers/aws-ofi-nccl/install/lib/:$HOME/drivers/nccl/build/lib:/usr/local/cuda-10.0/lib64:/opt/amazon/efa/lib64:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=WARN -x NCCL_TREE_THRESHOLD=0 --hostfile $HOME/hosts -n 16 -N 8 \
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/drivers/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
Running on single node is fine. The error only occur if I run on multiple nodes.
I am actually following the quip doc "Running nccl-tests on AWS EC2" and got the following error:
[ec2-user@ip-172-31-10-20 nccl-tests]$ ./run-test.sh | tee test.log
# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 2(factor) warmup iters: 5 iters: 100 validation: 1
#
# Using devices
# Rank 0 Pid 23389 on ip-172-31-10-20 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 1 Pid 23390 on ip-172-31-10-20 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 2 Pid 23391 on ip-172-31-10-20 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 3 Pid 23392 on ip-172-31-10-20 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 4 Pid 23393 on ip-172-31-10-20 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 5 Pid 23394 on ip-172-31-10-20 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 6 Pid 23395 on ip-172-31-10-20 device 6 [0x00] Tesla V100-SXM2-32GB
# Rank 7 Pid 23396 on ip-172-31-10-20 device 7 [0x00] Tesla V100-SXM2-32GB
# Rank 8 Pid 10508 on ip-172-31-1-59 device 0 [0x00] Tesla V100-SXM2-32GB
# Rank 9 Pid 10509 on ip-172-31-1-59 device 1 [0x00] Tesla V100-SXM2-32GB
# Rank 10 Pid 10510 on ip-172-31-1-59 device 2 [0x00] Tesla V100-SXM2-32GB
# Rank 11 Pid 10511 on ip-172-31-1-59 device 3 [0x00] Tesla V100-SXM2-32GB
# Rank 12 Pid 10512 on ip-172-31-1-59 device 4 [0x00] Tesla V100-SXM2-32GB
# Rank 13 Pid 10513 on ip-172-31-1-59 device 5 [0x00] Tesla V100-SXM2-32GB
# Rank 14 Pid 10514 on ip-172-31-1-59 device 6 [0x00] Tesla V100-SXM2-32GB
# Rank 15 Pid 10515 on ip-172-31-1-59 device 7 [0x00] Tesla V100-SXM2-32GB
NCCL version 2.4.6+cuda10.0
ip-172-31-1-59:10508:10581 [0] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
ip-172-31-1-59:10515:10584 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
ip-172-31-1-59: Test NCCL failure common.cu:782 'unhandled system error'
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[42896,1],8]
Exit code: 3
from aws-ofi-nccl.
This would happen if you are using the master
branch. Please use aws
branch when working with EFA.
from aws-ofi-nccl.
Please re-open if you see the issue again.
from aws-ofi-nccl.
Related Issues (20)
- Mellanox and EFA in Docker Image HOT 6
- NCCL WARN NET/OFI Only EFA provider is supported HOT 2
- potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69 HOT 1
- aws branch does not build on centos 7 with gcc 4.8.5 HOT 2
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Misleading comparison on unsigned integer
- Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older. HOT 11
- Unable to find libcudart.so (1.7.1) HOT 6
- Running nccl-perf tests documentation is missing MPI instructions HOT 3
- What are some AI/ML workloads users can utilize to test performance of the plugin?
- Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts HOT 4
- Support Amazon Linux 2023 (AL2023) HOT 2
- Support Red Hat Enterprise Linux 9+ HOT 4
- Add more examples with more recent cuda versions HOT 2
- Topology Discovery Regression HOT 2
- GPU direct HOT 1
- NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4 HOT 6
- Segfault after/during finalize with OpenMPI HOT 2
- Propagate "Invalid address" to NCCL communicator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.