Comments (14)
Hi the bandwidth is indeed too low for p4d. Two things worth looking into:
- Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass
-x FI_EFA_USE_DEVICE_RDMA=1
to yourmpirun
command. You should see[receive/send] via NET/AWS Libfabric/GDRDMA
- Make sure you have 4 EFA devices attached to each p4d instance. You can do this by running
lspci
on your instance.
from aws-ofi-nccl.
Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA
You also must be sure to pass the -g
option to the efa_installer.sh
on your host instance
from aws-ofi-nccl.
Make sure you set the environment variable FI_EFA_USE_DEVICE_RDMA to 1. This would mean pass -x FI_EFA_USE_DEVICE_RDMA=1 to your mpirun command. You should see [receive/send] via NET/AWS Libfabric/GDRDMA
You also must be sure to pass the
-g
option to theefa_installer.sh
on your host instance
Hi @leezu
I didn't install EFA part from scratch, the EFA is installed by Deep Learning AMI (ubuntu18.04, ver43.0).
Do you mean I need to reinstall it?
Hi @wzamazon
I have passed the FI_EFA_USE_DEVICE_RDMA=1
flag to mpirun command, here is the launch script:
/opt/amazon/openmpi/bin/mpirun \
-n ${NUM_PROCS} -H ${HOSTS} \
-x RDMAV_FORK_SAFE=1 -x NCCL_DEBUG=info \
-x FI_EFA_USE_DEVICE_RDMA=1 \
--mca pml ^cm --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20
from aws-ofi-nccl.
And here is the lspci log, 4 EFA NIC are attached. I have also verified with fi_info -p efa
command.
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
10:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
10:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
10:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
10:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
10:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:01.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
20:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
20:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
20:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
20:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
80:1a.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1b.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1c.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1d.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1e.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
80:1f.0 Bridge: NVIDIA Corporation Device 1af1 (rev a1)
90:02.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
90:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
90:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
90:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
90:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:03.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
a0:1b.0 Ethernet controller: Amazon.com, Inc. Device efa0
a0:1c.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1d.0 3D controller: NVIDIA Corporation Device 20b0 (rev a1)
a0:1e.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
a0:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
fi_info -p efa log:
provider: efa
fabric: EFA-fe80::c8:f6ff:fe4d:1df3
domain: rdmap16s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cc:b9ff:fe43:e655
domain: rdmap32s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7a:70ff:feea:56bb
domain: rdmap144s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::b5:58ff:fe4c:a9f
domain: rdmap160s27-rdm
version: 111.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::c8:f6ff:fe4d:1df3
domain: rdmap16s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::cc:b9ff:fe43:e655
domain: rdmap32s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7a:70ff:feea:56bb
domain: rdmap144s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::b5:58ff:fe4c:a9f
domain: rdmap160s27-dgrm
version: 111.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
from aws-ofi-nccl.
What version of aws-ofi-nccl plugin are you using?
from aws-ofi-nccl.
What version of aws-ofi-nccl plugin are you using?
v1.1.2,
compiled with following commands:
./autogen.sh --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
./configure --prefix=/usr/local --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-nccl=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
make
sudo make install
from aws-ofi-nccl.
check you LD_LIBRARY_PATH, it maybe the aws-ofi-nccl plugin you compiled was not picked up.
On p4d platform, there should be a line like
ip-192-168-2-54:14:14 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/ec2-user/install/plugin/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
in the log. According to https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-start-nccl-dlami.html
from aws-ofi-nccl.
LD_LIBRARY_PATH
The LD_LIBRARY_PATH includes the /usr/local/lib
, where I have installed the plugin.
where can I find the p4d-24xl-topo.xml
?
I can try to pass the file to nccl manually
from aws-ofi-nccl.
p4d-24xl-topo.xml
should be part of aws-ofi-nccl plugin
from aws-ofi-nccl.
p4d-24xl-topo.xml
should be part of aws-ofi-nccl plugin
I didn't find the file in the source code.
while the bandwidth issues solved with pre-compiled aws-ofi-nccl
inside /usr/local/cuda-11.0/efa/
folder.
Here is the command:
/opt/amazon/openmpi/bin/mpirun \
-n ${NUM_PROCS} -H ${HOSTS} \
-x FI_EFA_USE_DEVICE_RDMA=1 -x RDMAV_FORK_SAFE=1 --mca pml ^cm \
-x LD_LIBRARY_PATH=/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x FI_PROVIDER="efa" --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
$HOME/nccl-tests/build/all_reduce_perf -b 8 -e 512M -f 2 -g 1 -c 1 -n 20
So the plugin I compiled is possibly missed some configuration.
from aws-ofi-nccl.
The file is only present on the aws
branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?
from aws-ofi-nccl.
The file is only present on the
aws
branch (https://github.com/aws/aws-ofi-nccl/blob/aws/topology/p4d-24xl-topo.xml). Is it possible that code from the wrong branch was compiled?
What is the difference between master
branch and aws
branch? I thought this plugin is only for aws-platform
from aws-ofi-nccl.
Compiling source code at aws
branch solves the issue
from aws-ofi-nccl.
I followed up on your question about the difference between master
and aws
branches. The long-term goal of this project is to make a plugin that can be used to connect NCCL to any libfabric provider. Eventually the plan is to merge the AWS-specific branch into master and discontinue it (but no timeline on this at present).
from aws-ofi-nccl.
Related Issues (20)
- WARNING: unrecognized options: --with-nccl when attempting to install HOT 10
- Mellanox and EFA in Docker Image HOT 6
- NCCL WARN NET/OFI Only EFA provider is supported HOT 2
- potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69 HOT 1
- aws branch does not build on centos 7 with gcc 4.8.5 HOT 2
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Misleading comparison on unsigned integer
- Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older. HOT 11
- Unable to find libcudart.so (1.7.1) HOT 6
- Running nccl-perf tests documentation is missing MPI instructions HOT 3
- What are some AI/ML workloads users can utilize to test performance of the plugin?
- Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts HOT 4
- Support Amazon Linux 2023 (AL2023) HOT 2
- Support Red Hat Enterprise Linux 9+ HOT 4
- Add more examples with more recent cuda versions HOT 2
- Topology Discovery Regression HOT 2
- GPU direct HOT 1
- NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4 HOT 6
- Segfault after/during finalize with OpenMPI HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.