aws / aws-ofi-nccl Goto Github PK
View Code? Open in Web Editor NEWThis is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
License: Apache License 2.0
This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.
License: Apache License 2.0
Hi there,
While benchmarking distributed training with Metaseq OPT using multiple p4d.24xlarge
, we discovered an issue where the training processes launched by slurm using "opt-baselines" launcher running into "OSError: [Errno 12] Cannot allocate memory" with PyTorch Dataloader.
Traceback (most recent call last):
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 793, in <module>
cli_main()
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 789, in cli_main
distributed_utils.call_main(cfg, main)
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 289, in call_main
return distributed_main(
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 227, in distributed_main
retval = main(cfg, **kwargs)
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 190, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 339, in train
samples = next(progress_iter)
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/logging/progress_bar/json_progress_bar.py", line 38, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 62, in __iter__
for x in self.iterable:
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 851, in __next__
raise item
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 782, in run
for item in self._source:
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
return self._get_iterator()
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
w.start()
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
After debugging, we found 2 ways to avoid the above error.
1: unset FI_EFA_USE_DEVICE_RDMA
before launch training
2: adjusting --num-worker
to 0
,1
, or 2
from default 8
.
The resolution makes us believe that it might be the same issue as #69 .
PyTorch: 1.13.1
NVIDIA Driver: 525.85.12
CUDA: 11.7
NCCL: 2.16.2 inc_nsteps
EFA Installer: 1.21.0
AWS OFI NCCL: 1.5.0-aws
Running PyTorch through mpirun over EFA network I see the following error
log
Unable to write to EQ: Missing or unavailable event queue. err: Input/output error (5) prov_errno: Unknown error -114 (-114) prov/rxr/src/rxr.h:1042
This is connected to EFA since changing FI_PROVIDER
to something besides EFA
removes the error.
Also this error is related to PyTorch DataLoaders using fork strategy to create additional processes under mpirun. Changing strategy to spawn
allows training to proceed . Unfortunately there's a bug on the PyTorch side which limits number of GPUs that can be used with spawn
I am wondering if you all have plans for testing the plugin with RHEL 9, or if you will eventually claim support for RHEL9.
I'm done some preliminary testing with the OPX libfabric provider on a RHEL9 system with NVIDIA A40 gpus and have had some success running the unit tests, functional tests, and NVIDIA's nccl tests.
Hi there,
I have installed the plugin with nccl-v2.7.8, CUDA11. While using nccl-tests
didn't give reasonable results. (suppose to see bus bandwidth over 40GB/s). One thing I noticed is that all nccl-channel use via NET/AWS Libfabric/0
.
Here is the full log with output from nccl-debug info.
Any suggestion for configuring the plugin or nccl?
# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 validation: 1
#
# Using devices
# Rank 0 Pid 9738 on ip-172-31-13-103 device 0 [0x10] A100-SXM4-40GB
# Rank 1 Pid 9739 on ip-172-31-13-103 device 1 [0x10] A100-SXM4-40GB
# Rank 2 Pid 9740 on ip-172-31-13-103 device 2 [0x20] A100-SXM4-40GB
# Rank 3 Pid 9741 on ip-172-31-13-103 device 3 [0x20] A100-SXM4-40GB
# Rank 4 Pid 9742 on ip-172-31-13-103 device 4 [0x90] A100-SXM4-40GB
# Rank 5 Pid 9743 on ip-172-31-13-103 device 5 [0x90] A100-SXM4-40GB
# Rank 6 Pid 9744 on ip-172-31-13-103 device 6 [0xa0] A100-SXM4-40GB
# Rank 7 Pid 9748 on ip-172-31-13-103 device 7 [0xa0] A100-SXM4-40GB
# Rank 8 Pid 9921 on ip-172-31-6-104 device 0 [0x10] A100-SXM4-40GB
# Rank 9 Pid 9922 on ip-172-31-6-104 device 1 [0x10] A100-SXM4-40GB
# Rank 10 Pid 9923 on ip-172-31-6-104 device 2 [0x20] A100-SXM4-40GB
# Rank 11 Pid 9924 on ip-172-31-6-104 device 3 [0x20] A100-SXM4-40GB
# Rank 12 Pid 9925 on ip-172-31-6-104 device 4 [0x90] A100-SXM4-40GB
# Rank 13 Pid 9926 on ip-172-31-6-104 device 5 [0x90] A100-SXM4-40GB
# Rank 14 Pid 9927 on ip-172-31-6-104 device 6 [0xa0] A100-SXM4-40GB
# Rank 15 Pid 9928 on ip-172-31-6-104 device 7 [0xa0] A100-SXM4-40GB
ip-172-31-13-103:9738:9738 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9738:9738 [0] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9922:9922 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9923:9923 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9926:9926 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9924:9924 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9921:9921 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9927:9927 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9925:9925 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9928:9928 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9922:9922 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9923:9923 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9926:9926 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9924:9924 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9928:9928 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9925:9925 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9927:9927 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9921:9921 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-13-103:9744:9744 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9744:9744 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9748:9748 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9740:9740 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9739:9739 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9741:9741 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9743:9743 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9742:9742 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9740:9740 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9748:9748 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9739:9739 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9741:9741 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9800 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9738:9797 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9923:9981 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9927:9985 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9921:9986 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9739:9804 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9744:9798 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9748:9801 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9741:9803 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9743:9799 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9740:9802 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9928:9984 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9926:9982 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9925:9987 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9924:9983 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-6-104:9922:9980 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
ip-172-31-13-103:9738:9797 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9738:9797 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->9|9->0->1/-1/-1
ip-172-31-13-103:9738:9797 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-6-104:9925:9987 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9925:9987 [4] NCCL INFO Trees [0] 13/-1/-1->12->11|11->12->13/-1/-1 [1] 13/-1/-1->12->11|11->12->13/-1/-1
ip-172-31-6-104:9926:9982 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9926:9982 [5] NCCL INFO Trees [0] 14/-1/-1->13->12|12->13->14/-1/-1 [1] 14/-1/-1->13->12|12->13->14/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9739:9804 [1] NCCL INFO Trees [0] 2/8/-1->1->0|0->1->2/8/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9924:9983 [3] NCCL INFO Trees [0] 12/-1/-1->11->10|10->11->12/-1/-1 [1] 12/-1/-1->11->10|10->11->12/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9923:9981 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9923:9981 [2] NCCL INFO Trees [0] 11/-1/-1->10->9|9->10->11/-1/-1 [1] 11/-1/-1->10->9|9->10->11/-1/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9922:9980 [1] NCCL INFO Trees [0] 10/-1/-1->9->8|8->9->10/-1/-1 [1] 10/0/-1->9->8|8->9->10/0/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9928:9984 [7] NCCL INFO Trees [0] -1/-1/-1->15->14|14->15->-1/-1/-1 [1] -1/-1/-1->15->14|14->15->-1/-1/-1
ip-172-31-6-104:9923:9981 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-6-104:9927:9985 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9927:9985 [6] NCCL INFO Trees [0] 15/-1/-1->14->13|13->14->15/-1/-1 [1] 15/-1/-1->14->13|13->14->15/-1/-1
ip-172-31-6-104:9927:9985 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9925:9987 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9740:9802 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9740:9802 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-13-103:9740:9802 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-13-103:9741:9803 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9741:9803 [3] NCCL INFO Trees [0] 4/-1/-1->3->2|2->3->4/-1/-1 [1] 4/-1/-1->3->2|2->3->4/-1/-1
ip-172-31-13-103:9741:9803 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9926:9982 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9742:9800 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9742:9800 [4] NCCL INFO Trees [0] 5/-1/-1->4->3|3->4->5/-1/-1 [1] 5/-1/-1->4->3|3->4->5/-1/-1
ip-172-31-13-103:9742:9800 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9743:9799 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9743:9799 [5] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/-1/-1->5->4|4->5->6/-1/-1
ip-172-31-13-103:9743:9799 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9744:9798 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9744:9798 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
ip-172-31-13-103:9744:9798 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9748:9801 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9748:9801 [7] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
ip-172-31-13-103:9748:9801 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9921:9986 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9921:9986 [0] NCCL INFO Trees [0] 9/-1/-1->8->1|1->8->9/-1/-1 [1] 9/-1/-1->8->-1|-1->8->9/-1/-1
ip-172-31-6-104:9921:9986 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9742:9800 [4] NCCL INFO comm 0x7f2ac8000dc0 rank 4 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9925:9987 [4] NCCL INFO comm 0x7f8da8000dc0 rank 12 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9743:9799 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9743:9799 [5] NCCL INFO comm 0x7f73a4000dc0 rank 5 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-6-104:9926:9982 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9926:9982 [5] NCCL INFO comm 0x7f4328000dc0 rank 13 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9928:9984 [7] NCCL INFO comm 0x7f3a98000dc0 rank 15 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9927:9985 [6] NCCL INFO comm 0x7fa984000dc0 rank 14 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9924:9983 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9924:9983 [3] NCCL INFO comm 0x7f3bb8000dc0 rank 11 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO comm 0x7f6c4c000dc0 rank 3 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9923:9981 [2] NCCL INFO comm 0x7f5a24000dc0 rank 10 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9740:9802 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9740:9802 [2] NCCL INFO comm 0x7f36c8000dc0 rank 2 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9739:9804 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO comm 0x7fed38000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9748:9801 [7] NCCL INFO comm 0x7fbd04000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9744:9798 [6] NCCL INFO comm 0x7f23b8000dc0 rank 6 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9921:9986 [0] NCCL INFO comm 0x7fdc3c000dc0 rank 8 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9922:9980 [1] NCCL INFO comm 0x7f17c8000dc0 rank 9 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9738:9797 [0] NCCL INFO comm 0x7faad8000dc0 rank 0 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-13-103:9738:9738 [0] NCCL INFO Launch mode Parallel
8 2 float sum 79.04 0.00 0.00 4e-07 64.55 0.00 0.00 4e-07
16 4 float sum 63.60 0.00 0.00 4e-07 63.44 0.00 0.00 2e-07
32 8 float sum 64.62 0.00 0.00 2e-07 65.10 0.00 0.00 1e-07
64 16 float sum 65.01 0.00 0.00 1e-07 64.24 0.00 0.00 1e-07
128 32 float sum 65.24 0.00 0.00 1e-07 64.54 0.00 0.00 1e-07
256 64 float sum 65.57 0.00 0.01 1e-07 64.45 0.00 0.01 1e-07
512 128 float sum 66.94 0.01 0.01 1e-07 66.50 0.01 0.01 1e-07
1024 256 float sum 68.47 0.01 0.03 4e-07 68.17 0.02 0.03 4e-07
2048 512 float sum 72.68 0.03 0.05 4e-07 72.91 0.03 0.05 4e-07
4096 1024 float sum 78.42 0.05 0.10 4e-07 77.77 0.05 0.10 4e-07
8192 2048 float sum 83.65 0.10 0.18 4e-07 81.94 0.10 0.19 4e-07
16384 4096 float sum 96.39 0.17 0.32 4e-07 93.38 0.18 0.33 4e-07
32768 8192 float sum 116.7 0.28 0.53 4e-07 114.8 0.29 0.54 4e-07
65536 16384 float sum 155.3 0.42 0.79 4e-07 153.4 0.43 0.80 4e-07
131072 32768 float sum 203.0 0.65 1.21 4e-07 203.7 0.64 1.21 4e-07
262144 65536 float sum 315.6 0.83 1.56 4e-07 311.2 0.84 1.58 4e-07
524288 131072 float sum 409.9 1.28 2.40 4e-07 407.1 1.29 2.41 4e-07
1048576 262144 float sum 597.2 1.76 3.29 4e-07 594.6 1.76 3.31 4e-07
2097152 524288 float sum 926.9 2.26 4.24 4e-07 924.9 2.27 4.25 4e-07
4194304 1048576 float sum 1583.5 2.65 4.97 4e-07 1584.1 2.65 4.96 4e-07
8388608 2097152 float sum 2939.5 2.85 5.35 4e-07 2929.9 2.86 5.37 4e-07
16777216 4194304 float sum 5366.1 3.13 5.86 4e-07 5381.9 3.12 5.85 4e-07
33554432 8388608 float sum 10305 3.26 6.10 4e-07 10294 3.26 6.11 4e-07
67108864 16777216 float sum 20358 3.30 6.18 4e-07 20341 3.30 6.19 4e-07
134217728 33554432 float sum 39328 3.41 6.40 4e-07 39392 3.41 6.39 4e-07
268435456 67108864 float sum 77210 3.48 6.52 4e-07 77304 3.47 6.51 4e-07
536870912 134217728 float sum 152989 3.51 6.58 4e-07 152798 3.51 6.59 4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 2.32362
Hi. I'm a code monkey on the PSM3 team and I'm investigating an internal problem report and trying to understand if it's a real issue or not. It relates to ofi_iflush() issuing an RDMA read to, presumably, somehow force all outstanding I/Os to complete. It seems pretty obvious that the data being read is not, itself, important, and I'm wondering how adding another I/O to the queue guarantees that all outstanding I/Os complete?
Could you shed some light on this for me, please? Does NCCL assume that when the RDMA read completes all prior I/Os have also completed?
Hello aws_ofi_nccl maintainers,
I noticed that PSM3 provider was removed from libfabric build in your Travis config file, probably due to #85
As the issue has been resolved already, I guess this is not longer needed, could you please revisit this and re-enable PSM3 if possible?
BRs,
Denis.
Do we have any plans on supporting Ubuntu 22.04?
I'm attempting to assemble a single docker image to support both EFA and mellanox as we split workloads between different clouds, and it's easy to use the wrong image on the wrong cloud. I currently have something like this:
#####################################
# Install EFA and AWS-OFI-NCCL plugin
#####################################
ARG EFA_INSTALLER_VERSION=latest
ARG AWS_OFI_NCCL_VERSION=v1.5.0-aws
ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH
RUN if [ -n "$CUDA_VERSION" ] ; then \
cd /tmp && \
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
cd aws-efa-installer && \
apt-get update && \
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
rm -rf /tmp/aws-efa-installer* ; \
fi
RUN if [ -n "$CUDA_VERSION" ] ; then \
git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
cd /opt/aws-ofi-nccl && \
git checkout ${AWS_OFI_NCCL_VERSION} && \
./autogen.sh && \
./configure --prefix=/opt/aws-ofi-nccl/install \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
--disable-tests && \
make && make install ; \
fi
###################################
# Mellanox OFED driver installation
###################################
ARG MOFED_VERSION
RUN if [ -n "$MOFED_VERSION" ] ; then \
mkdir -p /tmp/mofed && \
wget -nv -P /tmp/mofed http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz && \
tar -zxvf /tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz -C /tmp/mofed && \
/tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force && \
rm -rf /tmp/mofed ; \
fi
and I either comment out the mellanox part of the EFA part depending on which image I want to build. When attempting to build both at the same time, it appears the mellanox installation wipes out something from EFA resulting in EFA not working. If I install EFA after mellanox, I encounter the following error:
The following packages have unmet dependencies:
libibmad5-dbg : Depends: libibmad5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
libibnetdisc-dev : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
libibnetdisc5-dbg : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
libibumad3-dbg : Depends: libibumad3 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
librdmacm1-dbg : Depends: librdmacm1 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
I would love to get some insight into
Hello aws-ofi-nccl maintainers,
During our testing with PSM3 provider we faced with abnormal program termination in ofi_iflush phase.
Assumption that provider requires memory registering was reintroduced by recent changes.
Could you please consider merging #97 which fixes this?
Wanted to discuss the following topic: How to get PSM3 added to the continuous integration plan for aws-ofi-nccl plugin so mistakes like this are detected sooner?
I've observed that on p3dn.24xlarge
instances, multi-node pytorch training jobs using EFA and aws-ofi-nccl
has worse performance on AL2, compared to an equivalent setup on Ubuntu 20.04.
AMI | EFA Enabled | Throughput (wpm) |
---|---|---|
AL2 Deep Learning Base AMI | Yes | ~65000 |
AL2 Deep Learning Base AMI | No | ~45000 |
Ubuntu Deep Learning Base AMI | Yes | ~120000 |
Ubuntu Deep Learning Base AMI | No | ~110000 |
The reason I don't have numbers for Ubuntu with EFA is because of #107.
On both AL2 and Ubuntu 20.04 instances, we're using p3dn.24xlarge
in the same VPC, in the same placement group (cluster strategy). EFA is enabled on the network interfaces, and I've verified that EFA drivers are installed.
Dockerfile used for training job:
fairseq_train_wrapped
(built into the Dockerfile, a workaround from facebookresearch/fairseq#4302 to set rank):
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
from fairseq_cli.train import cli_main
if __name__ == "__main__":
sys.argv += ["--local_rank", os.getenv("LOCAL_RANK")]
sys.exit(cli_main())
CUDA version: 11.3
Training Data download and pre-processing following https://github.com/pytorch/fairseq/blob/main/examples/language_model/README.md
Two node setup used for training job.
AMI: Deep Learning Base AMI (Amazon Linux 2) Version 52.0, (ami-07f6f7cc742921659
in us-west-2
)
Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
/opt/amazon/efa/bin/fi_info -p efa
:
provider: efa
fabric: EFA-fe80::58:92ff:fe3f:a7a3
domain: rdmap0s6-rdm
version: 114.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::58:92ff:fe3f:a7a3
domain: rdmap0s6-dgrm
version: 114.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
nvidia-docker
version: 20.10.7
Training command (EFA enabled) (executed on both training nodes):
nvidia-docker run \
--mount type=bind,src=/mnt/fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env NCCL_SOCKET_IFNAME=eth0 \
--env FI_PROVIDER=efa \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_al2_efa.txt
Note that we're using the --device /dev/infiniband/uverbs0
flag to pass through EFA. For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa
.
ami-061dac75dbd529aef
in us-west-2
)
Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
/opt/amazon/efa/bin/fi_info -p efa
:
provider: efa
fabric: EFA-fe80::da:b9ff:fe04:8af
domain: rdmap0s6-rdm
version: 114.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::da:b9ff:fe04:8af
domain: rdmap0s6-dgrm
version: 114.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
nvidia-docker
version: 20.10.14
Training command (EFA enabled) (executed on both training nodes):
nvidia-docker run \
--mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env FI_PROVIDER=EFA \
--env NCCL_SOCKET_IFNAME=ens5 \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
--ulimit memlock=-1 \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
--master_addr=$MASTER_IP --master_port=12345 \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt
Compared to the AL2 run, we add --ulimit memlock=-1
due to #107. Note that adding the same flag to the equivalent AL2 run makes no difference in performance.
For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa
.
AMI | EFA Enabled | Throughput (wpm) |
---|---|---|
AL2 Deep Learning Base AMI | Yes | ~65000 |
AL2 Deep Learning Base AMI | No | ~45000 |
Ubuntu Deep Learning Base AMI | Yes | ~120000 |
Ubuntu Deep Learning Base AMI | No | ~110000 |
Interesting bits
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/634c600a11e36d1384215ab08366e774
Initialization:
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket
[0]:NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/8c4298450299b33dd9a4c0559f50eccc
Initialization:
ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
Full logs: https://gist.github.com/yukunlin/ba8e41131abc1a7e4fb288b480d94b8f
Initialization:
ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3
Full log: https://gist.github.com/yukunlin/95a1036dba1c3a677f8f130e6cf23fbf
p4d.24xlarge instances support up to four EFAs. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-amis.
I added 2 EFA network interface in p4d. It always reports an error.
Moreover we can't use Public IPs. Because Public IPs can only be assigned to instances with one network interface.
2020-12-11 17:14:32,501 f5db57be Thread-28 nccl(1/1) vm_util.py:407 DEBUG Ran: {aws --output json ec2 run-instances --region=us-east-1 --client-token=72aa7d99-280b-4949-b19c-3ab70a8ad6cd --image-id=ami-0404ddec9491a5a31 --instance-type=p4d.24xlarge --key-name=perfkit-key-f5db57be --tag-specifications=ResourceType=instance,Tags=[{Key=timeout_utc,Value=20201212t021429z},{Key=create_time_utc,Value=20201212t011429z},{Key=benchmark,Value=nccl},{Key=perfkit_uuid,Value=f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a},{Key=owner,Value=tohaowu},{Key=benchmark_uid,Value=nccl0}] --network-interfaces=[{"InterfaceType": "efa", "DeviceIndex": 0, "SubnetId": "subnet-0911d87f03ca7d3e5", "Groups": ["sg-053c797b3233f857f"]}, {"InterfaceType": "efa", "DeviceIndex": 1, "SubnetId": "subnet-02ad934424869f529", "Groups": ["sg-053c797b3233f857f"]}] --block-device-mappings=[{"DeviceName": "/dev/sda1", "Ebs": {"DeleteOnTermination": true, "SnapshotId": "snap-003d530d53f605cf3", "VolumeSize": 105, "VolumeType": "gp2"}}] --placement=AvailabilityZone=us-east-1a,GroupName=perfkit-f5db57be-79b7ca8c4588}
ReturnCode:0, WallTime:0:02.65s, CPU:0.39s, MaxMemory:57420kb
STDOUT: {
"Groups": [],
"Instances": [
{
"AmiLaunchIndex": 0,
"ImageId": "ami-0404ddec9491a5a31",
"InstanceId": "i-08f87c8e0c176485d",
"InstanceType": "p4d.24xlarge",
"KeyName": "perfkit-key-f5db57be",
"LaunchTime": "2020-12-12T01:14:32.000Z",
"Monitoring": {
"State": "disabled"
},
"Placement": {
"AvailabilityZone": "us-east-1a",
"GroupName": "",
"Tenancy": "default"
},
"PrivateDnsName": "ip-10-0-0-71.ec2.internal",
"PrivateIpAddress": "10.0.0.71",
"ProductCodes": [],
"PublicDnsName": "",
"State": {
"Code": 0,
"Name": "pending"
},
"StateTransitionReason": "",
"SubnetId": "subnet-0911d87f03ca7d3e5",
"VpcId": "vpc-0e77f7b3dc0a69e21",
"Architecture": "x86_64",
"BlockDeviceMappings": [],
"ClientToken": "72aa7d99-280b-4949-b19c-3ab70a8ad6cd",
"EbsOptimized": false,
"EnaSupport": true,
"Hypervisor": "xen",
"NetworkInterfaces": [
{
"Attachment": {
"AttachTime": "2020-12-12T01:14:32.000Z",
"AttachmentId": "eni-attach-01ed94ae6b3920106",
"DeleteOnTermination": true,
"DeviceIndex": 1,
"Status": "attaching"
},
"Description": "",
"Groups": [
{
"GroupName": "default",
"GroupId": "sg-053c797b3233f857f"
}
],
"Ipv6Addresses": [],
"MacAddress": "0e:f1:b3:8c:2a:01",
"NetworkInterfaceId": "eni-08e48fabb32dd8641",
"OwnerId": "835761027970",
"PrivateDnsName": "ip-10-0-1-81.ec2.internal",
"PrivateIpAddress": "10.0.1.81",
"PrivateIpAddresses": [
{
"Primary": true,
"PrivateDnsName": "ip-10-0-1-81.ec2.internal",
"PrivateIpAddress": "10.0.1.81"
}
],
"SourceDestCheck": true,
"Status": "in-use",
"SubnetId": "subnet-02ad934424869f529",
"VpcId": "vpc-0e77f7b3dc0a69e21",
"InterfaceType": "efa"
},
{
"Attachment": {
"AttachTime": "2020-12-12T01:14:32.000Z",
"AttachmentId": "eni-attach-091873659b0d0ccf3",
"DeleteOnTermination": true,
"DeviceIndex": 0,
"Status": "attaching"
},
"Description": "",
"Groups": [
{
"GroupName": "default",
"GroupId": "sg-053c797b3233f857f"
}
],
"Ipv6Addresses": [],
"MacAddress": "0e:ee:ef:8d:4f:a7",
"NetworkInterfaceId": "eni-08536700350aed736",
"OwnerId": "835761027970",
"PrivateDnsName": "ip-10-0-0-71.ec2.internal",
"PrivateIpAddress": "10.0.0.71",
"PrivateIpAddresses": [
{
"Primary": true,
"PrivateDnsName": "ip-10-0-0-71.ec2.internal",
"PrivateIpAddress": "10.0.0.71"
}
],
"SourceDestCheck": true,
"Status": "in-use",
"SubnetId": "subnet-0911d87f03ca7d3e5",
"VpcId": "vpc-0e77f7b3dc0a69e21",
"InterfaceType": "efa"
}
],
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SecurityGroups": [
{
"GroupName": "default",
"GroupId": "sg-053c797b3233f857f"
}
],
"SourceDestCheck": true,
"StateReason": {
"Code": "pending",
"Message": "pending"
},
"Tags": [
{
"Key": "owner",
"Value": "tohaowu"
},
{
"Key": "benchmark_uid",
"Value": "nccl0"
},
{
"Key": "perfkit_uuid",
"Value": "f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a"
},
{
"Key": "timeout_utc",
"Value": "20201212t021429z"
},
{
"Key": "benchmark",
"Value": "nccl"
},
{
"Key": "create_time_utc",
"Value": "20201212t011429z"
}
],
"VirtualizationType": "hvm",
"CpuOptions": {
"CoreCount": 48,
"ThreadsPerCore": 2
},
"CapacityReservationSpecification": {
"CapacityReservationPreference": "open"
},
"MetadataOptions": {
"State": "pending",
"HttpTokens": "optional",
"HttpPutResponseHopLimit": 1,
"HttpEndpoint": "enabled"
}
}
],
"OwnerId": "835761027970",
"ReservationId": "r-098d70a7ef009aaf4"
}
STDERR:
2020-12-11 17:14:32,502 f5db57be Thread-28 nccl(1/1) vm_util.py:353 INFO Running: aws --output json ec2 describe-instances --region=us-east-1 --filter=Name=client-token,Values=72aa7d99-280b-4949-b19c-3ab70a8ad6cd
2020-12-11 17:14:33,394 f5db57be Thread-28 nccl(1/1) vm_util.py:407 DEBUG Ran: {aws --output json ec2 describe-instances --region=us-east-1 --filter=Name=client-token,Values=72aa7d99-280b-4949-b19c-3ab70a8ad6cd}
ReturnCode:0, WallTime:0:00.87s, CPU:0.40s, MaxMemory:56720kb
STDOUT: {
"Reservations": [
{
"Groups": [],
"Instances": [
{
"AmiLaunchIndex": 0,
"ImageId": "ami-0404ddec9491a5a31",
"InstanceId": "i-08f87c8e0c176485d",
"InstanceType": "p4d.24xlarge",
"KeyName": "perfkit-key-f5db57be",
"LaunchTime": "2020-12-12T01:14:32.000Z",
"Monitoring": {
"State": "disabled"
},
"Placement": {
"AvailabilityZone": "us-east-1a",
"GroupName": "",
"Tenancy": "default"
},
"PrivateDnsName": "",
"ProductCodes": [],
"PublicDnsName": "",
"State": {
"Code": 32,
"Name": "shutting-down"
},
"StateTransitionReason": "Server.InternalError",
"Architecture": "x86_64",
"BlockDeviceMappings": [],
"ClientToken": "72aa7d99-280b-4949-b19c-3ab70a8ad6cd",
"EbsOptimized": false,
"EnaSupport": true,
"Hypervisor": "xen",
"NetworkInterfaces": [],
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SecurityGroups": [],
"StateReason": {
"Code": "Server.InternalError",
"Message": "Server.InternalError: Internal error on launch"
},
"Tags": [
{
"Key": "owner",
"Value": "tohaowu"
},
{
"Key": "benchmark_uid",
"Value": "nccl0"
},
{
"Key": "perfkit_uuid",
"Value": "f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a"
},
{
"Key": "timeout_utc",
"Value": "20201212t021429z"
},
{
"Key": "benchmark",
"Value": "nccl"
},
{
"Key": "create_time_utc",
"Value": "20201212t011429z"
}
],
"VirtualizationType": "hvm",
"CpuOptions": {
"CoreCount": 48,
"ThreadsPerCore": 2
},
"CapacityReservationSpecification": {
"CapacityReservationPreference": "open"
},
"HibernationOptions": {
"Configured": false
},
"MetadataOptions": {
"State": "pending",
"HttpTokens": "optional",
"HttpPutResponseHopLimit": 1,
"HttpEndpoint": "enabled"
}
}
],
"OwnerId": "835761027970",
"ReservationId": "r-098d70a7ef009aaf4"
}
]
}
If I pass --nccl-path=~/nccl/build
, the make
command complains about 'nccl_net.h' not found. I later found that I need to pass the complete path as /home/ubuntu/nccl/build
. Automatically reconigizing ~
will make it easier for others to install.
Hello aws_ofi_nccl maintainers,
For GDR-capable providers which do not request memory registration (i.e. provide FI_HMEM but not FI_MR_HMEM or FI_MR_LOCAL), there is an issue in current implementation.
The function register_mr_buffers() keeps mr_handle intact, and as it is set to NULL by caller, mr_handle will be NULL.
Later during ofi_iflush(), NULL mr_handle are being treat as an error condition, which leads to returning ncclSystemError. But having NULL here is normal if provider did not ask for memory registration.
Later during ofi_closeRecv() mr_handle is being passed to fi_close(), and in this case it is NULL, this lead to segfault.
To fix this the following pull request was created #81
Please consider merging it.
BRs,
Denis
I would greatly appreciate some clarity about how FI_OPT_CUDA_API_PERMITTED is used, it's relationship to FI_HMEM, and it's relationship to GPUDirect/GDRCopy if any.
Launch Command:
/opt/amazon/openmpi/bin/mpirun -np 2 --host 172.0.1.23,172.0.1.161 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH /usr/local/bin/nccl_message_transfer
Error:
INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
INFO: Function: main Line: 49: NET/OFI Process rank 0 started. NCCLNet device used on ip-172-0-1-23 is AWS Libfabric.
INFO: Function: main Line: 53: NET/OFI Received 1 network devices
INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0
INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
INFO: Function: main Line: 49: NET/OFI Process rank 1 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.
INFO: Function: main Line: 53: NET/OFI Received 1 network devices
INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0
WARN: Function: create_nccl_ofi_component Line: 459: NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
WARN: Function: create_nccl_ofi_component Line: 459: NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
MPI hangs, with no other output - looking at top with 100% CPU usage with nccl_message_transger on both nodes.
Additional error in dmesg
[ 1632.420956] infiniband efa_0: Failed to process command ALLOC_PD (opcode 14) comp_status 1 err -12
[ 1632.420957] infiniband efa_0: Failed to allocate pd[-12]
[ 1632.726850] infiniband efa_0: Failed to allocate pd[-12]
Environment:
Ubuntu 18.04
NVIDIA 430.26 CUDA 10.0
EFA 1.5
PATH=/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/cuda:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
LD_LIBRARY_PATH=/opt/nccl/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:
EFA enabled:
ubuntu@ip-172-0-1-23:~$ fi_info -p efa
provider: efa
fabric: EFA-fe80::ca:e4ff:fec9:7534
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::ca:e4ff:fec9:7534
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::ca:e4ff:fec9:7534
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
I was stuck with this problem for many days. May I know how to debug such issues?
Currently, I am using all_to_all NCCL MPI (world_size=64) + EFA. The error is as follows:
NVIDIA/nccl#563 (NCCL engineer is not sure what's the problem and needs EFA's help)
I'm using horovod with EFA, and the multi-node job hangs with
...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...
ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::7e:9eff:fed3:c48a
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
If this is already a supported OS, please feel free to close
Hey folks, we at Cornelis Networks are interested in adopting this ofi-nccl plugin.
If we create a patch to support fi_context2 structures, would that be something you are willing to entertain?
Thanks!
Hello, I'm trying to test Horovod with EFA+nccl, but it was stuck when testing with multi nodes, I think the main error is: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory.
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,5]<stdout>:ip-172-31-6-189:155:696 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO Ring 01 : 3 -> 10 [send] via NET/AWS Libfabric/1
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 01 : 7[7] -> 3[3] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 01 : 4[4] -> 7[7] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 02 : 13 -> 4 [receive] via NET/AWS Libfabric/2
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 02 : 7[7] -> 5[5] via P2P/IPC
[1,4]<stdout>:
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO include/net.h:21 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO transport/net.cc:334 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:340 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:650 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:815 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:951 -> 2
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 01 : 12[4] -> 15[7] via P2P/IPC
[1,14]<stdout>:ip-172-31-3-127:43:574 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO Ring 01 : 11 -> 2 [send] via NET/AWS Libfabric/1
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 02 : 5 -> 12 [receive] via NET/AWS Libfabric/2
[1,13]<stdout>:ip-172-31-3-127:42:576 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
[1,12]<stdout>:
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO include/net.h:21 -> 2
some information which may be helpful:
I use efa 1.5.1, and fi_info -p efa works:
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
EFA installer version: 1.5.1
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 libfabric1_1.8.0amzn1.0_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi_3.1.4-2_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-dbg_1.8.0amzn1.0_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64
I also test nccl all_reduce_perf and it works as well, to run:
curl http://169.254.169.254/latest/meta-data/local-ipv4 >> my-hosts &&
/opt/amazon/openmpi/bin/mpirun
-x FI_PROVIDER=efa
-x FI_EFA_TX_MIN_CREDITS=64
-x NCCL_DEBUG=INFO
-x NCCL_TREE_THRESHOLD=0
--hostfile my-hosts -n 8 -N 8
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none
/opt/build/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100
I get:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_nccl-efa-test.log
about Horovod, my command is
NCCL_DEBUG=INFO
HOROVOD_NUM_NCCL_STREAMS=4
horovodrun -np 16 -H localhost:8,172.31.3.127:8
--mpi-args="-x PATH -x LD_LIBRARY_PATH -x FI_PROVIDER=efa -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_TREE_THRESHOLD=0" python3 /home/cluster/distributed-training/test_scripts/pytorch_synthetic_benchmark.py --model resnet101 --batch-size 32 |& grep -v "Read -1"
the complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_horovod-test.log
PS: In Fact, I would prefer to use EFA 1.8.3 (to maintain same test envrioment), but I got more error in this version:
[1,0]<stderr>:terminate called after throwing an instance of 'std::system_error'
[1,0]<stderr>: what(): Resource deadlock avoided
[1,0]<stderr>:[ip-172-31-6-189:00789] *** Process received signal ***
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal: Aborted (6)
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal code: (-6)
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 0] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f17f5b34f20]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f17f5b34e97]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f17f5b36801]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 3] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f17f0d40957]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 4] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ab6)[0x7f17f0d46ab6]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 5] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b19)[0x7f17f0d45b19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 6] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f17f0d46488]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10613)[0x7f17f0aac613]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x2b1)[0x7f17f0aacb71]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 9] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f17f0d46d17]
[1,0]<stderr>:[ip-172-31-6-189:00789] [10] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8ea19)[0x7f17f0d42a19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [11] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8dc)[0x7f17f0d718dc]
[1,0]<stderr>:[ip-172-31-6-189:00789] [12] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xaa8)[0x7f17c47176b8]
[1,0]<stderr>:[ip-172-31-6-189:00789] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43041)[0x7f17f5b39041]
[1,0]<stderr>:[ip-172-31-6-189:00789] [14] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x4313a)[0x7f17f5b3913a]
[1,0]<stderr>:[ip-172-31-6-189:00789] [15] /opt/amazon/efa/lib/libfabric.so.1(+0x5ebbf)[0x7f16dbd7ebbf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[16] /opt/amazon/efa/lib/libfabric.so.1(+0xf3f2)[0x7f16dbd2f3f2]
[1,0]<stderr>:[ip-172-31-6-189:00789] [17] /opt/amazon/efa/lib/libfabric.so.1(fi_getinfo+0x45d)[0x7f16dbd2fa9d]
[1,0]<stderr>:[ip-172-31-6-189:00789] [18] /usr/local/lib/libnccl-net.so(+0x2045)[0x7f16e41cc045]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[19] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1065)[0x7f17c47a2065]
[1,0]<stderr>:[ip-172-31-6-189:00789] [20] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1b4f)[0x7f17c47a2b4f]
[1,0]<stderr>:[ip-172-31-6-189:00789] [21] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x245)[0x7f17c47633a5]
[1,0]<stderr>:[ip-172-31-6-189:00789] [22] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x54)[0x7f17c47634d4]
[1,0]<stderr>:[ip-172-31-6-189:00789] [23] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7f17c472e061]
[1,0]<stderr>:[ip-172-31-6-189:00789] [24] /usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0xa1)[0x7f17c472e371]
[1,0]<stderr>:[ip-172-31-6-189:00789] [25] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x5e440)[0x7f17c470f440]
[1,0]<stderr>:[ip-172-31-6-189:00789] [26] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xaecf)[0x7f17f2000ecf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [27] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f17f58de6db]
[1,0]<stderr>:[ip-172-31-6-189:00789] [28] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f17f5c1788f]
[1,0]<stderr>:[ip-172-31-6-189:00789] *** End of error message ***
complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_horovod-test.log
and nccl all_reduce_perf test also works:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_nccl-efa-test.log
and fi_info -p efa
, cat /opt/amazon/efa_installed_packages
:
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-rdm
version: 2.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 2.0
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::4a:78ff:fef8:fa03
domain: efa_0-dgrm
version: 1.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
SHM transfer will be disabled because of ptrace protection.
To enable SHM transfer, please refer to the man page fi_efa.7 for more information.
Also note that turning off ptrace protection has security implications. If you cannot
turn it off, you can suppress this message by setting FI_EFA_ENABLE_SHM_TRANSFER=0
EFA installer version: 1.8.3
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-aws-bin_1.9.0amzn1.1_amd64 libfabric-aws-dev_1.9.0amzn1.1_amd64 libfabric1-aws_1.9.0amzn1.1_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi40-aws_4.0.2-1_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-aws-dbg_1.9.0amzn1.1_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64
Could someone please give me some suggestions about the right direction? Thank you very much!
Hey AWS team,
At Cornelis Networks we have had good luck so far with the plugin. We are able to run all of NVIDIA's nccl performance tests with the plugin and our OPX libfabric provider!
We want to start running some real pytorch/tensorflow workloads and assess performance for some 'real-world' applications. I was hoping you'd be able to point me towards some apps/workloads that you folks use for performance benchmarking :) I noticed in #240 that someone mentioned the 'PyTorch-FSDP' workload, more examples similar to that would be greatly appreciated.
Thanks again for accepting our patches! Also, if there is a more appropriate forum for general questions like this (email, slack, etc), please let me know.
Looking at the docs and having run some test jobs on p3dn instances, it looks like GPU RDMA over EFA is not available on p3dn instances and only on p4d? Is this correct?
Some observations on p3dn (running a fairseq - pytorch - training job using the nccl backend):
NCCL_ALGO = tree
is faster than NCCL_ALGO=ring
(assuming no RDMA on p3dn, I'm thinking this is due to two copies: DtoH -> HtoH (inter node, over EFA) -> HtoD)NCCL_DEBUG=INFO
we're observing debug logs like:
[0]:ip-10-0-0-157:24:146 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0
# -- NOT --
[0]:ip-10-0-0-157:24:146 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0/GDRDMA
nv_peer_mem
is not loaded:
$ lsmod | grep nv
root@ip-10-0-0-159:/# lsmod | grep nv
nvidia_drm 61440 0
nvidia_modeset 1196032 1 nvidia_drm
nvidia_uvm 1138688 16
nvidia 35262464 2858 nvidia_uvm,nvidia_modeset
drm_kms_helper 184320 1 nvidia_drm
drm 425984 4 drm_kms_helper,nvidia,nvidia_drm
i2c_core 77824 3 drm_kms_helper,nvidia,drm
Hi,
Apologies in advance if this is not the right place to ask.
I am trying to run PyTorch DDP with NCCL backend on SageMaker. I have my own Docker image which uses the following as a base
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py38-cu117-ubuntu20.04-sagemaker
This is configured as a SageMaker Training job through the AWS Console (I am working on someone else's setup, so no full control/understanding of the underlying details).
The instance is ml.p3.16xlarge
with 8 V100 GPUs. I don't need node-to-node communication, just communication between the GPUs on a single node. I have no issues running my code on EC2 instance (ml.p3.16xlarge
, but not using the docker image directly)
Running my job, I see the following warnings in the logs. and then the job seems to "hang".
algo-1:249:249 [0] ofi_init:1288 NCCL WARN NET/OFI Only EFA provider is supported
algo-1:249:249 [0] ofi_init:1339 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
Could someone help me figure out the source of the issue here? I am not too familiar with EFA but digging through documentation here I see it's only supported on p3dn.24xlarge
and p4d.24xlarge
instances, which is not what I need. Is this a configuration issue with the container? Why is only EFA provider supported through NCCL?
Any pointers would be really appreciated.
Hi there,
I am running nccl-tests
on two p4d instances, I noticed the performance drop significantly for buffer size 64MB, as shown in following log:
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-13-103:7699:7699 [0] NCCL INFO Launch mode Parallel
8388608 2097152 float sum 1321.5 6.35 11.90 5e-07 1202.2 6.98 13.08 5e-07
16777216 4194304 float sum 1896.7 8.85 16.59 5e-07 1872.0 8.96 16.80 5e-07
33554432 8388608 float sum 1931.6 17.37 32.57 5e-07 1963.7 17.09 32.04 5e-07
67108864 16777216 float sum 6586.3 10.19 19.10 5e-07 6591.5 10.18 19.09 5e-07
134217728 33554432 float sum 8189.9 16.39 30.73 5e-07 8157.7 16.45 30.85 5e-07
268435456 67108864 float sum 11812 22.72 42.61 5e-07 11919 22.52 42.23 5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 25.6327
#
So I use I fixed step size 2MB, and experiment with buffer size from 32MB to 100MB.
As shown in the log, we can see the bus bandwidth decreasing to 19GB/s start from 46MB (48234496).
This is interesting. Any idea why this happened?
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-13-103:7852:7852 [0] NCCL INFO Launch mode Parallel
ip-172-31-13-103:7862:7914 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:7862:7914 [7] NCCL INFO comm 0x7f8aec000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
33554432 8388608 float sum 1926.3 17.42 32.66 5e-07 1887.0 17.78 33.34 5e-07
35651584 8912896 float sum 2040.2 17.47 32.76 5e-07 1984.1 17.97 33.69 5e-07
37748736 9437184 float sum 2203.1 17.13 32.13 5e-07 2059.0 18.33 34.38 5e-07
39845888 9961472 float sum 2237.6 17.81 33.39 5e-07 2222.5 17.93 33.62 5e-07
41943040 10485760 float sum 2339.2 17.93 33.62 5e-07 2327.6 18.02 33.79 5e-07
44040192 11010048 float sum 2408.5 18.28 34.28 5e-07 2420.2 18.20 34.12 5e-07
46137344 11534336 float sum 2521.8 18.30 34.30 5e-07 2515.4 18.34 34.39 5e-07
48234496 12058624 float sum 4953.8 9.74 18.26 5e-07 4861.2 9.92 18.60 5e-07
50331648 12582912 float sum 5067.2 9.93 18.62 5e-07 5134.0 9.80 18.38 5e-07
52428800 13107200 float sum 5293.5 9.90 18.57 5e-07 5254.6 9.98 18.71 5e-07
54525952 13631488 float sum 5424.3 10.05 18.85 5e-07 5457.3 9.99 18.73 5e-07
56623104 14155776 float sum 5651.5 10.02 18.79 5e-07 5613.9 10.09 18.91 5e-07
58720256 14680064 float sum 5805.5 10.11 18.96 5e-07 5835.1 10.06 18.87 5e-07
60817408 15204352 float sum 6018.0 10.11 18.95 5e-07 6011.0 10.12 18.97 5e-07
62914560 15728640 float sum 6218.5 10.12 18.97 5e-07 6259.8 10.05 18.84 5e-07
65011712 16252928 float sum 6467.3 10.05 18.85 5e-07 6437.6 10.10 18.94 5e-07
67108864 16777216 float sum 6579.8 10.20 19.12 5e-07 6583.1 10.19 19.11 5e-07
69206016 17301504 float sum 6817.4 10.15 19.03 5e-07 6832.4 10.13 18.99 5e-07
71303168 17825792 float sum 7018.6 10.16 19.05 5e-07 7090.8 10.06 18.85 5e-07
73400320 18350080 float sum 7231.0 10.15 19.03 5e-07 7199.5 10.20 19.12 5e-07
75497472 18874368 float sum 7402.5 10.20 19.12 5e-07 7369.1 10.25 19.21 5e-07
77594624 19398656 float sum 7564.9 10.26 19.23 5e-07 7607.6 10.20 19.12 5e-07
79691776 19922944 float sum 7791.7 10.23 19.18 5e-07 7788.8 10.23 19.18 5e-07
81788928 20447232 float sum 7958.2 10.28 19.27 5e-07 7949.7 10.29 19.29 5e-07
83886080 20971520 float sum 8144.1 10.30 19.31 5e-07 8171.6 10.27 19.25 5e-07
85983232 21495808 float sum 8342.2 10.31 19.33 5e-07 8385.3 10.25 19.23 5e-07
88080384 22020096 float sum 8601.3 10.24 19.20 5e-07 8600.0 10.24 19.20 5e-07
90177536 22544384 float sum 8814.6 10.23 19.18 5e-07 8802.5 10.24 19.21 5e-07
92274688 23068672 float sum 8982.6 10.27 19.26 5e-07 9002.8 10.25 19.22 5e-07
94371840 23592960 float sum 9199.3 10.26 19.23 5e-07 9166.0 10.30 19.30 5e-07
96468992 24117248 float sum 9360.9 10.31 19.32 5e-07 9420.4 10.24 19.20 5e-07
98566144 24641536 float sum 9589.1 10.28 19.27 5e-07 9586.6 10.28 19.28 5e-07
100663296 25165824 float sum 9769.6 10.30 19.32 5e-07 9762.1 10.31 19.33 5e-07
102760448 25690112 float sum 9997.1 10.28 19.27 5e-07 9980.3 10.30 19.31 5e-07
104857600 26214400 float sum 10162 10.32 19.35 5e-07 10208 10.27 19.26 5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth : 21.9716
#
Hi team,
The README provides this environment variable OFI_NCCL_GDR_FLUSH_DISABLE
.
Is it safe to set it to 1? i.e. disable GDR flush for this plugin. Would setting it to 1 improve the performance?
Thank you!
I have this error, and want to know how to slove it.
$ mpirun -n 2 --host node13,node14 ./nccl_message_transfer
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1033: NET/OFI Selected Provider is psm2
INFO: Function: main Line: 69: NET/OFI Process rank 0 started. NCCLNet device used on node13 is AWS Libfabric.
INFO: Function: main Line: 73: NET/OFI Received 1 network devices
INFO: Function: ofi_pciPath Line: 1094: NET/OFI No NIC info for dev 0
INFO: Function: ofi_getProperties Line: 1194: NET/OFI No NIC info for dev 0. Supplying default values for NIC properties.
TRACE: Function: print_dev_props Line: 78: NET/OFI ****************** Device 0 Properties ******************
TRACE: Function: print_dev_props Line: 79: NET/OFI hfi1_0;hfi1_1: PCIe Path: (null)
TRACE: Function: print_dev_props Line: 80: NET/OFI hfi1_0;hfi1_1: Plugin Support: 1
TRACE: Function: print_dev_props Line: 81: NET/OFI hfi1_0;hfi1_1: Device GUID: 0
TRACE: Function: print_dev_props Line: 82: NET/OFI hfi1_0;hfi1_1: Device Speed: 0
TRACE: Function: print_dev_props Line: 83: NET/OFI hfi1_0;hfi1_1: Device Port: 1
TRACE: Function: print_dev_props Line: 84: NET/OFI hfi1_0;hfi1_1: Device Maximum Communicators: 65535
TRACE: Function: main Line: 104: NET/OFI Rank 0 uses 0 device for communication
INFO: Function: main Line: 114: NET/OFI Server: Listening on dev 0
WARN: Function: create_nccl_ofi_component Line: 708: NET/OFI Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument
WARN: Function: main Line: 115: NET/OFI OFI NCCL failure: 2
$ fi_info -p psm2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
Hello,
Using a fresh deployment of ubuntu 22.04 AMI on p4d.24xlarge
instances and installing baremetal the CUDA stack followed by EFA using those commands:
curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
sudo ./aws-efa-installer/efa_installer.sh -y
Then running a container image that was built with those commands in its Dockerfile:
...
RUN cd /root \
&& curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
&& tar -xf /root/aws-efa-installer-latest.tar.gz \
&& cd aws-efa-installer \
&& apt-get update \
&& apt-get install -y libhwloc-dev \
&& ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
&& apt-get install -y libfabric-bin \
&& rm -rf /var/lib/apt/lists/*
RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
&& cd /opt/aws-ofi-nccl \
&& git checkout v1.7.1-aws \
&& ./autogen.sh \
&& ./configure --prefix=/opt/aws-ofi-nccl/ \
--with-libfabric=/opt/amazon/efa/ \
--with-cuda=/usr/local/cuda \
&& make && make install
...
We get this error running the NCCL tests between 2 containers (one on each instance):
configure_sendrecv_inorder:213 NCCL WARN NET/OFI Couldn't set FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES. RC: -92, ERROR: Protocol not available
nccl_net_ofi_init:1163 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
Running fi_info -p efa -t FI_EP_RDM
we get:
provider: efa
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-rdm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-dgrm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_RXD
^^ We see 2 provider
sections per EFA adapter (this example is for 1 adapter per instance).
WORKAROUND:
To solve this issue, we need to reinstall efa on the containers with:
./efa_installer.sh -y -k --uninstall
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify
Then the NCCL tests will work fine with EFA and for some reason fi_info -p efa -t FI_EP_RDM
will return a single provider
section per EFA adapter:
provider: efa
fabric: EFA-fe80::8d3:d9ff:fe9a:6059
domain: rdmap197s0-rdm
version: 111.0
type: FI_EP_RDM
protocol: FI_PROTO_EFA
Do you know why 2 libfabric provider sections show up in the broken scenario and why we need to re-install EFA on the containers after running them?
Currently on the aws branch, ofi will use the --prefix
location during configure to store the xml topology files. It will set XML_DIR to prefix value during compile time. https://github.com/aws/aws-ofi-nccl/blob/aws/src/Makefile.am#L8
During runtime it will set NCCL_TOPO_FILE
according to the XML_DIR
value: https://github.com/aws/aws-ofi-nccl/blob/aws/src/nccl_ofi_net.c#L1178
This will create a problem when I build OFI plugin library at one place (a CI system), and package it and try to run it on another machine (e.g. via conda install to a $CONDA_PREFIX).
If build path (--prefx) and $CONDA_PREFIX does not match, OFI will fail to find topology file and result in a regression.
Could you update the logic to use NCCL_TOPO_FILE value if there is a preset environment variable? This will allow me to configure OFI to read XML files at any location. Also add logging on which topology file is used.
I'm able to workaround this issue by using Conda's binary relocation patch, which will change the $PREFIX value during installation. But this is still a good feature to have.
Thanks!
I have a docker image with EFA and aws-ofi-nccl installed. This image "works" with EFA when running on an AL2 AMI (but is slow, see #106). However, when the same image is run on an ubuntu AMI, we get an error message:
[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993
p3dn.24xlarge
(EFA enabled)ami-061dac75dbd529aef
in us-west-2
)
Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)
/opt/amazon/efa/bin/fi_info -p efa
:
provider: efa
fabric: EFA-fe80::da:b9ff:fe04:8af
domain: rdmap0s6-rdm
version: 114.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::da:b9ff:fe04:8af
domain: rdmap0s6-dgrm
version: 114.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
nvidia-docker
version: 20.10.14
Training command (exectued on both training nodes):
nvidia-docker run \
--mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
--network host \
--device /dev/infiniband/uverbs0 \
--env FI_PROVIDER=EFA \
--env NCCL_SOCKET_IFNAME=ens5 \
--env LOGLEVEL=INFO \
--env NCCL_PROTO=simple \
--env NCCL_DEBUG=INFO \
919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
--master_addr=$MASTER_IP --master_port=12345 \
fairseq_train_wrapped \
--task language_modeling \
/job/fairseq/data-bin/wikitext-103 \
--save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu" \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 \
--max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt
Full log: https://gist.github.com/yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0
This seems related to #44. I did follow #44 (comment) and ran nccl-test
on the AMI successfully, which indicates EFA is working (on the AMI at least).
Note that the docker image works (it doesn't crash) when running from an AL2 AMI (see #106)
Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1
This test failed recently. The error message is following
File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.
When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.
This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base
Hi All,
I am running PyTorch distributed training (code here) on 4 AWS A100 nodes with EFA. We got an error of
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
when launching the experiments. Using ethernet gives correct results thus we think that the issue happens in EFA. Could you please take a look on it?
Here are the library versions:
PyTorch 1.12.0
CUDA: 11.6
NCCL: 2.10.3
aws-ofi-nccl: 1.3.0
libfabric: libfabric.so.1.19.0
cuda-drivers-fabricmanager-510
cuda-drivers-510
Full log before crashes:
ip-10-216-179-193:964326:964326 [0] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964326:964326 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.6
ip-10-216-179-193:964328:964328 [2] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964332:964332 [6] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964331:964331 [5] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964327:964327 [1] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964330:964330 [4] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964329:964329 [3] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964333:964333 [7] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964328:964328 [2] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964331:964331 [5] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964330:964330 [4] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964332:964332 [6] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964327:964327 [1] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964329:964329 [3] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964333:964333 [7] NCCL INFO Using network AWS Libfabric
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Traceback (most recent call last):
File "main.py", line 177, in <module>
devjobs_multi_node_main(args)
File "main.py", line 77, in devjobs_multi_node_main
mp.spawn(main, nprocs=args.gpus, args=(args,))
File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
line 198, in start_processes
while not context.join():
File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
line 140, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT
Let me know if any other information or some other tests can help. Thanks!
Best,
Hao
When running nccl-tests
, I see the following error:
nccl-tests-worker-0:39:45 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws
nccl-tests-worker-0:39:45 [0] nccl_net_ofi_cuda_init:39 NCCL WARN NET/OFI Failed to find CUDA Runtime library: libcudart.so: cannot open shared object file: No such file or directory
The base image nvidia/cuda:12-runtime-ubuntu:22.04
only includes the file /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12
, but not libcudart.so
. Adding the symlink as below works around the issue, but it would be good if it worked out of the box, by looking for libcudart.so.*
, if this is possible?
RUN ln -s /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so
I've installed EFA driver succesfully on AWS EC2 Deep Learning AMI. I've also built NCCL 2.4.2. When building aws-ofi-nccl, I get the following error during configure
[ec2-user@ip-10-0-48-207 aws-ofi-nccl]$ ./configure --with-cuda=/usr/local/cuda/ --with-nccl=$NCCL_HOME --with-mpi=/opt/amazon/efa/bin/
checking whether make supports nested variables... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether make sets $(MAKE)... yes
checking for ar... ar
checking the archiver (ar) interface... ar
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking for inline... inline
checking for size_t... yes
checking for ssize_t... yes
checking for uint64_t... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for memset... yes
checking for realpath... yes
checking limits.h usability... yes
checking limits.h presence... yes
checking for limits.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for unistd.h... (cached) yes
checking rdma/fabric.h usability... no
checking rdma/fabric.h presence... no
checking for rdma/fabric.h... no
configure: error: unable to find required headers
[ec2-user@ip-10-0-48-207 aws-ofi-nccl]$ make -j 32 NCCL_HOME=$NCCL_HOME
make: *** No targets specified and no makefile found. Stop.
I have libfabric source, but if I make and install it, it overwrites EFA libfabric. Any ideas?
I ran into a bit of a headache because I wasn't referring to the nccl-tests documentation when attempting to build the performance tests.
All I needed was to add MPI=1 MPI_HOME=/path/to/mpi
and things worked well :)
Do you folks think it's reasonable to add this to the README.md? If so, I'll open a PR to close this out.
Something like:
2. Build the tests
cd nccl-tests/
make NCCL_HOME=~/nccl/build MPI=1 MPI_HOME=$MPI_HOME
Running nccl tests on image public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3 results in results in the following errors.
[1,13]<stdout>:nccl-tests-worker-1:35:89 [5] ofi_process_cq:1039 NCCL WARN NET/OFI Request 0x7fbb944701b8 completed with error. RC: 20. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 1, size: 0, state: CREATED, direction: SEND } [1,13]<stderr>:libfabric:35:1668017255::efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 20, prov_err: Not a directory (20)
Full error log: https://gist.github.com/Csinclair0/48ee4f1388d4901e6958069ee272a305
MPIJob spec : https://gist.github.com/Csinclair0/18dcc46fb3ae98c189b9c5dcd56ead2f
Hello aws_ofi_nccl maintainers,
NCCL, during its sequence of GDR capability checking, calls ofi_listen, ofi_connect and ofi_accept sequentially.
ofi_connect issues non-blocking fi_tsend to send small buffer of data to recipient.
Later it tries to ensure that data was actually transferred.
Underlying provider detects that caller wants to talk to self.
In case if it implements rendezvous mode for talking to self, transfer considered complete only if caller actually received buffer, i.e. ofi_accept must be called before ofi_connect. At the same time accepting considered complete only when data actually there when accept was called. Deadlock.
To fix this the following pull request was created #84
Please consider merging it.
BRs,
Denis
aws-ofi-nccl/src/nccl_ofi_freelist.c
Line 156 in 928ce18
allocation_count
is unsigned so can't be < 0.
Moving from NVIDIA/nccl#235
Summary of findings.
TLDR; 8 machines throughput is 70% of on-prem infiniband, 16 machines is 60%
The fi_tsend()
call in ofi_connect()
uses a local stack variable (local_ep_addr) as the send buffer. This buffer must remain valid until the completion for this send has been reaped, however, the function exits without ensuring this. This could result in a corrupted connect message if the memory is reused and overwritten before the provider has sent the data.
Either an inject-style send should be used, or the local_ep_addr needs to live on the heap somewhere.
[ec2-user@ip-172-31-23-78 aws-ofi-nccl-0.9]$ make
mpicc -I/opt/nccl/include -I/usr/local/cuda/include -I/opt/libfabric/include -I./include/ -o /home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring -L/home/ec2-user/oss/aws-ofi-nccl-0.9 -lnccl-net \
-L/opt/libfabric/lib -lfabric -L/usr/local/cuda/lib64 -lcudart \
/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring.c -ldl
In file included from /home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring.c:5:0:
/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/test-common.h:11:10: fatal error: nccl_net.h: No such file or directory
#include <nccl_net.h>
^~~~~~~~~~~~
compilation terminated.
make: *** [/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring] Error 1
I am attempting to build off an existing Dockerfile to add EFA multinode support for some runs I would like to do on AWS. My dockerfile can be found here. I am looking at this guide and this example dockerfile.
I would like to avoid rebuilding pytorch from source, so I am instead trying to clone the same version of nccl and then build the aws-ofi-nccl plugin pointing to that version of nccl. While doing this, I am hitting the following error:
configure: WARNING: unrecognized options: --with-nccl
on the following command:
./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-nccl=/opt/nccl/build --with-mpi=/opt/amazon/openmpi/
It appears the latest on branch aws
has changed to no longer have this flag. Are there updated instructions on how to install this
Hello aws-ofi-nccl maintainers,
In case if underlying provider (PSM3 in our case) is able to provide support for multi-rail, it typically exposes virtual device with NULL name. Plugin blindly strdups this name and in case of NULL there segfault happens.
#101 fixes this.
Could you please consider merging it?
BRs,
Denis.
Hi Team,
Step 8 of this EFA guidance seems to recommend the following NCCL settings:
NCCL_ALGO=ring
— enables ring algorithm for collective operations.
NCCL_PROTO=simple
— instructs NCCL to use a simple protocol for communication. Currently, the EFA provider does not support LL protocols. Enabling them could lead to data corruption.
Are they still the recommended settings today? In this recent issue, the recommendation seems to have changed to NOT changing any NCCL defaults: #65 (comment)
Any clarification would be much appreciated!
Specifically, does EFA support LL or LL128 today? Or do users still need to set protocol to Simple to avoid data corruptions?
Thank you!
I tried EFA 1.7.0 in early this year and NCCL-test all reduced, and the max out of place algbw is 104 Gpbs.
For EFA 1.7.1, the max out of place algbw is 65 Gpbs
Would you help me check it?
In P4DN 2 nodes, I upgraded NCCL to v2.10.3 and PyTorch to v1.10. However, EFA is not enabled even all related libraries (PyTorch, aws-ofi-nccl, nccl-tests) load the same NCCL library (located at /usr/lib/x86_64-linux-gnu
).
The error logs from aws-ofi-nccl/tests and nccl_tests are as follows.
/opt/amazon/openmpi/bin/mpirun \
-n 16 -N 8 --hostfile /job/hostfile \
-x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/$USER/aws-ofi-nccl/install/lib:/home/$USER/aws-ofi-nccl:$LD_LIBRARY_PATH \
-x FI_PROVIDER="efa" --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
/home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer
10.3.35.83: + /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 --hostfile /job/hostfile -x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/deepspeed/aws-ofi-nccl/install/lib:/home/deepspeed/aws-ofi-nccl:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 -x FI_PROVIDER=efa --mca btl_tcp_if_exclude lo,docker0 --bind-to none /home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer
10.3.35.83: Warning: Permanently added '[10.3.60.130]:2022' (RSA) to the list of known hosts.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: Primary job terminated normally, but 1 process returned
10.3.35.83: a non-zero exit code. Per user-direction, the job has been aborted.
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: mpirun detected that one or more processes exited with non-zero status, thus causing
10.3.35.83: the job to be terminated. The first process to do so was:
10.3.35.83:
10.3.35.83: Process name: [[43564,1],7]
10.3.35.83: Exit code: 2
10.3.35.83: --------------------------------------------------------------------------
cd /fsx/hchaoyan/home/m5/nccl-tests && \
sudo rm -rf build && \
make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu
echo "running P4DN test"
cd /fsx/hchaoyan/home/m5/nccl-tests
$(which mpirun) -allow-run-as-root --mca plm_rsh_no_tree_spawn 1 \
-x FI_PROVIDER="efa" \
-x NCCL_SOCKET_IFNAME=eth \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x RDMAV_FORK_SAFE=1 \
-x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/$USER/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_MIN_NCHANNELS=8 \
-x NCCL_ALGO=Ring \
-x OMP_NUM_THREADS=8 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-bind-to none \
-n 16 -N 8 \
--mca pml ^cm \
--hostfile /job/hostfile \
-mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 \
./build/all_reduce_perf -b 2G -e 2G -g 1 -n 30
10.3.35.83: # Rank 15 Pid 42 on ip-10-3-60-130 device 7 [0xa0] A100-SXM4-40GB
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83:
10.3.35.83: ip-10-3-35-83:362:362 [0] find_ofi_provider:543 NCCL WARN NET/OFI Couldn't find any optimal provider
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/IB : No device found.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Socket : Using [0]eth0:10.3.35.83<0> [1]eth1:10.3.42.59<0> [2]eth2:10.3.48.121<0> [3]eth3:10.3.49.25<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Using network Socket
10.3.35.83: NCCL version 2.10.3+cuda11.0
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:365:365 [3] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
Operating System: Centos 7
GCC version: 4.8.5
Host: EC2 g4dn.12xlarge
aws
branch fails make
with error
make[2]: Entering directory `aws_ofi_nccl/source/aws_ofi_nccl/tests'
CC nccl_message_transfer.o
CC ring.o
CCLD nccl_connection
ring.c: In function ‘main’:
ring.c:42:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (int recv_n = 0; recv_n < nrecv; recv_n++) {
^
ring.c:42:2: note: use option -std=c99 or -std=gnu99 to compile your code
nccl_message_transfer.c: In function ‘main’:
nccl_message_transfer.c:41:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
for (int recv_n = 0; recv_n < nrecv; recv_n++) {
^
nccl_message_transfer.c:41:2: note: use option -std=c99 or -std=gnu99 to compile your code
make[2]: *** [ring.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [nccl_message_transfer.o] Error 1
make[2]: Leaving directory `aws_ofi_nccl/source/aws_ofi_nccl/tests'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `aws_ofi_nccl/source/aws_ofi_nccl'
make: *** [all] Error 2
./autogen.sh
./configure --prefix=${PWD}/install --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
make -j # <--- error
On some hardware, even a simple tensorflow test can end up calling ofi_iflush() tens of thousands of times per rank this serves no benefit since PSM3 ensures that GPU buffers are kept in sync after each I/O. In addition, because ofi_iflush() calls ofi_nccl_gdr_flush_disable() on every invocation, and ofi_nccl_gdr_flush_disable() acquires a mutex on each invocation, this adds further drag on performance.
I noticed there are two branches of this repo. What are their differences? Which one should I be installing?
Hello aws_ofi_nccl maintainers,
Please let me know if this is not the best location to post the issue and I will close this issue.
I am unable to figure out why the process is hanging after the error message is shown.
My training setup:
2 ml.g4dn.12xlarge instances on AWS Sagemaker trying to run distributed training with Pytorch base image 763104351884.dkr.ecr.us-west-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker
.
The two instances are running inside a private subnet with a NAT gateway attached to the subnet.
All outputs are from host-1
Output of lspci -i efa
:
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:07.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
00:08.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:09.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1a.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:1b.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1c.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1d.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller
Output of cat /opt/amazon/efa_installed_packages
:
# EFA installer version: 1.15.1
# Debug packages installed: no
# Packages installed:
efa-config_1.9_all efa-profile_1.5_all libfabric-aws-bin_1.14.0amzn1.0_amd64 libfabric-aws-dev_1.14.0amzn1.0_amd64 libfabric1-aws_1.14.0amzn1.0_amd64 openmpi40-aws_4.1.2-1_amd64 ibacm_39.0-1_amd64 ibverbs-providers_39.0-1_amd64 ibverbs-utils_39.0-1_amd64 infiniband-diags_39.0-1_amd64 libibmad-dev_39.0-1_amd64 libibmad5_39.0-1_amd64 libibnetdisc-dev_39.0-1_amd64 libibnetdisc5_39.0-1_amd64 libibumad-dev_39.0-1_amd64 libibumad3_39.0-1_amd64 libibverbs-dev_39.0-1_amd64 libibverbs1_39.0-1_amd64 librdmacm-dev_39.0-1_amd64 librdmacm1_39.0-1_amd64 rdma-core_39.0-1_amd64 rdmacm-utils_39.0-1_amd64
Output of /opt/amazon/efa/bin/fi_info -p efa
:
provider: efa
fabric: EFA-fe80::424:a9ff:fed5:b935
domain: efa_0-rdm
version: 114.10
type: FI_EP_RDM
protocol: FI_PROTO_EFA
provider: efa
fabric: EFA-fe80::424:a9ff:fed5:b935
domain: efa_0-dgrm
version: 114.10
type: FI_EP_DGRAM
protocol: FI_PROTO_EFA
Output of training job:
distributed training is initialized with nccl backend in pytorch using the mmaction2 training library. I set FI_EFA_USE_DEVICE_RDMA=0 because the T4 gpus do not support RDMA. Also, the cmd is run as os.system() command in the entrypoint passed to sagemaker
cmd=
NCCL_SOCKET_IFNAME=eth0 FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=0 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.22b20221214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/flash_attn-0.1-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/einops-0.6.0-py3.8.egg python -m torch.distributed.launch --nnodes=2 --node_rank=0 --master_addr=algo-1 --nproc_per_node=4 --master_port=7777 <train script> <config.py>
algo-1:462:462 [0] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:462:462 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:462:462 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:462:462 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:462:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:462:462 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:462:462 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:462:462 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
algo-1:463:463 [1] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:464:464 [2] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:465:465 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:464:464 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:463:463 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:464:464 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:463:463 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:465:465 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:464:464 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:463:463 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:465:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:465:465 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:465:465 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:465:465 [3] NCCL INFO Using network AWS Libfabric
libfabric:463:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
libfabric:464:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:463:463 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:463:463 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:463:463 [1] NCCL INFO Using network AWS Libfabric
algo-1:464:464 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:464:464 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:464:464 [2] NCCL INFO Using network AWS Libfabric
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:463:557 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
algo-1:464:558 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
algo-1:465:556 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
algo-1:462:555 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4 5 6 7
algo-1:462:555 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4 5 6 7
algo-1:462:555 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
algo-1:462:555 [0] NCCL INFO Channel 00 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:465:556 [3] NCCL INFO Channel 00 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:465:556 [3] NCCL INFO Channel 01 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Connected all rings
algo-1:462:555 [0] NCCL INFO Channel 01 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:462:555 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:462:555 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Connected all rings
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)
algo-1:465:556 [3] ofi_process_cq:1033 NCCL WARN NET/OFI Request 0x7f6390394d18 completed with error. RC: 21. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 0, size: 0, state: CREATED, direction: SEND }
I see the same error on the algo-2 instance as well.
Pytorch version and helper output by mmaction2:
2023-01-06 21:47:12,614 - mmaction - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201402
- Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.3
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
- CuDNN 8.3.2 (built against CUDA 11.5)
- Magma 2.5.2
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1+cu113
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMAction2: 0.24.1+
------------------------------------------------------------
2023-01-06 21:47:12,614 - mmaction - INFO - Distributed training: True
Hi,
I'm wondering whether it's possible for this to utilize elastic fabric adapter.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.