aws / aws-ofi-nccl Goto Github PK

View Code? Open in Web Editor NEW

127.0 127.0 50.0 1.09 MB

This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.

License: Apache License 2.0

Makefile 0.79% C 93.53% Shell 0.02% M4 4.80% Groovy 0.87%

aws-ofi-nccl's People

Contributors

Stargazers

Watchers

aws-ofi-nccl's Issues

potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69

Hi there,

While benchmarking distributed training with Metaseq OPT using multiple p4d.24xlarge, we discovered an issue where the training processes launched by slurm using "opt-baselines" launcher running into "OSError: [Errno 12] Cannot allocate memory" with PyTorch Dataloader.

Traceback (most recent call last):
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 793, in <module>
    cli_main()
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 789, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 289, in call_main
    return distributed_main(
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 227, in distributed_main
    retval = main(cfg, **kwargs)
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 339, in train
    samples = next(progress_iter)
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/logging/progress_bar/json_progress_bar.py", line 38, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 62, in __iter__
    for x in self.iterable:
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 851, in __next__
    raise item
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 782, in run
    for item in self._source:
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

After debugging, we found 2 ways to avoid the above error.
1: unset FI_EFA_USE_DEVICE_RDMA before launch training
2: adjusting --num-worker to 0 ,1, or 2 from default 8.

The resolution makes us believe that it might be the same issue as #69 .

System Info:

PyTorch: 1.13.1
NVIDIA Driver: 525.85.12
CUDA: 11.7
NCCL: 2.16.2 inc_nsteps
EFA Installer: 1.21.0
AWS OFI NCCL: 1.5.0-aws

Unable to write to EQ: Missing or unavailable event queue. err: Input/output error (5)

Running PyTorch through mpirun over EFA network I see the following error
log

Unable to write to EQ: Missing or unavailable event queue. err: Input/output error (5) prov_errno: Unknown error -114 (-114) prov/rxr/src/rxr.h:1042

This is connected to EFA since changing FI_PROVIDER to something besides EFA removes the error.

Also this error is related to PyTorch DataLoaders using fork strategy to create additional processes under mpirun. Changing strategy to spawn allows training to proceed . Unfortunately there's a bug on the PyTorch side which limits number of GPUs that can be used with spawn

Support Red Hat Enterprise Linux 9+

I am wondering if you all have plans for testing the plugin with RHEL 9, or if you will eventually claim support for RHEL9.

I'm done some preliminary testing with the OPX libfabric provider on a RHEL9 system with NVIDIA A40 gpus and have had some success running the unit tests, functional tests, and NVIDIA's nccl tests.

bad performance on two p4 instances (nccl-v2.7.8, aws-ofi-nccl-v1.1.2)

Hi there,
I have installed the plugin with nccl-v2.7.8, CUDA11. While using nccl-tests didn't give reasonable results. (suppose to see bus bandwidth over 40GB/s). One thing I noticed is that all nccl-channel use via NET/AWS Libfabric/0.
Here is the full log with output from nccl-debug info.

Any suggestion for configuring the plugin or nccl?

# nThread 1 nGpus 1 minBytes 8 maxBytes 536870912 step: 2(factor) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   9738 on ip-172-31-13-103 device  0 [0x10] A100-SXM4-40GB
#   Rank  1 Pid   9739 on ip-172-31-13-103 device  1 [0x10] A100-SXM4-40GB
#   Rank  2 Pid   9740 on ip-172-31-13-103 device  2 [0x20] A100-SXM4-40GB
#   Rank  3 Pid   9741 on ip-172-31-13-103 device  3 [0x20] A100-SXM4-40GB
#   Rank  4 Pid   9742 on ip-172-31-13-103 device  4 [0x90] A100-SXM4-40GB
#   Rank  5 Pid   9743 on ip-172-31-13-103 device  5 [0x90] A100-SXM4-40GB
#   Rank  6 Pid   9744 on ip-172-31-13-103 device  6 [0xa0] A100-SXM4-40GB
#   Rank  7 Pid   9748 on ip-172-31-13-103 device  7 [0xa0] A100-SXM4-40GB
#   Rank  8 Pid   9921 on ip-172-31-6-104 device  0 [0x10] A100-SXM4-40GB
#   Rank  9 Pid   9922 on ip-172-31-6-104 device  1 [0x10] A100-SXM4-40GB
#   Rank 10 Pid   9923 on ip-172-31-6-104 device  2 [0x20] A100-SXM4-40GB
#   Rank 11 Pid   9924 on ip-172-31-6-104 device  3 [0x20] A100-SXM4-40GB
#   Rank 12 Pid   9925 on ip-172-31-6-104 device  4 [0x90] A100-SXM4-40GB
#   Rank 13 Pid   9926 on ip-172-31-6-104 device  5 [0x90] A100-SXM4-40GB
#   Rank 14 Pid   9927 on ip-172-31-6-104 device  6 [0xa0] A100-SXM4-40GB
#   Rank 15 Pid   9928 on ip-172-31-6-104 device  7 [0xa0] A100-SXM4-40GB
ip-172-31-13-103:9738:9738 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9738:9738 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9738:9738 [0] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9922:9922 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9923:9923 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9926:9926 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9924:9924 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9921:9921 [0] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9927:9927 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9925:9925 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9928:9928 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.6.104<0>
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9922:9922 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9922:9922 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9923:9923 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9923:9923 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9926:9926 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9926:9926 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9924:9924 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9924:9924 [3] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9928:9928 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9928:9928 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9925:9925 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9925:9925 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9927:9927 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9927:9927 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-6-104:9921:9921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-6-104:9921:9921 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.7.8+cuda11.0
ip-172-31-13-103:9744:9744 [6] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9744:9744 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9744:9744 [6] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9748:9748 [7] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9740:9740 [2] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9739:9739 [1] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9741:9741 [3] NCCL INFO Bootstrap : Using [0]ens32:172.31.13.103<0>
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9743:9743 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9743:9743 [5] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9742:9742 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9742:9742 [4] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9740:9740 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9740:9740 [2] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9748:9748 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9748:9748 [7] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9739:9739 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9739:9739 [1] NCCL INFO Using network AWS Libfabric
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-172-31-13-103:9741:9741 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v3 symbol.
ip-172-31-13-103:9741:9741 [3] NCCL INFO Using network AWS Libfabric

ip-172-31-13-103:9742:9800 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9738:9797 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9923:9981 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9927:9985 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9921:9986 [0] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9739:9804 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9744:9798 [6] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9748:9801 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9741:9803 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9743:9799 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-13-103:9740:9802 [2] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9928:9984 [7] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9926:9982 [5] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9925:9987 [4] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9924:9983 [3] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order


ip-172-31-6-104:9922:9980 [1] graph/search.cc:765 NCCL WARN Could not find a path for pattern 4, falling back to simple order

ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15
ip-172-31-13-103:9738:9797 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9738:9797 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->9|9->0->1/-1/-1
ip-172-31-13-103:9738:9797 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-6-104:9925:9987 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9925:9987 [4] NCCL INFO Trees [0] 13/-1/-1->12->11|11->12->13/-1/-1 [1] 13/-1/-1->12->11|11->12->13/-1/-1
ip-172-31-6-104:9926:9982 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9926:9982 [5] NCCL INFO Trees [0] 14/-1/-1->13->12|12->13->14/-1/-1 [1] 14/-1/-1->13->12|12->13->14/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9739:9804 [1] NCCL INFO Trees [0] 2/8/-1->1->0|0->1->2/8/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9924:9983 [3] NCCL INFO Trees [0] 12/-1/-1->11->10|10->11->12/-1/-1 [1] 12/-1/-1->11->10|10->11->12/-1/-1
ip-172-31-13-103:9739:9804 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9923:9981 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9923:9981 [2] NCCL INFO Trees [0] 11/-1/-1->10->9|9->10->11/-1/-1 [1] 11/-1/-1->10->9|9->10->11/-1/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9922:9980 [1] NCCL INFO Trees [0] 10/-1/-1->9->8|8->9->10/-1/-1 [1] 10/0/-1->9->8|8->9->10/0/-1
ip-172-31-6-104:9922:9980 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9928:9984 [7] NCCL INFO Trees [0] -1/-1/-1->15->14|14->15->-1/-1/-1 [1] -1/-1/-1->15->14|14->15->-1/-1/-1
ip-172-31-6-104:9923:9981 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-6-104:9927:9985 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9927:9985 [6] NCCL INFO Trees [0] 15/-1/-1->14->13|13->14->15/-1/-1 [1] 15/-1/-1->14->13|13->14->15/-1/-1
ip-172-31-6-104:9927:9985 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9925:9987 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9740:9802 [2] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9740:9802 [2] NCCL INFO Trees [0] 3/-1/-1->2->1|1->2->3/-1/-1 [1] 3/-1/-1->2->1|1->2->3/-1/-1
ip-172-31-6-104:9924:9983 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-13-103:9740:9802 [2] NCCL INFO Setting affinity for GPU 2 to ff,ffff0000,00ffffff
ip-172-31-13-103:9741:9803 [3] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9741:9803 [3] NCCL INFO Trees [0] 4/-1/-1->3->2|2->3->4/-1/-1 [1] 4/-1/-1->3->2|2->3->4/-1/-1
ip-172-31-13-103:9741:9803 [3] NCCL INFO Setting affinity for GPU 3 to ff,ffff0000,00ffffff
ip-172-31-6-104:9928:9984 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9926:9982 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9742:9800 [4] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9742:9800 [4] NCCL INFO Trees [0] 5/-1/-1->4->3|3->4->5/-1/-1 [1] 5/-1/-1->4->3|3->4->5/-1/-1
ip-172-31-13-103:9742:9800 [4] NCCL INFO Setting affinity for GPU 4 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9743:9799 [5] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9743:9799 [5] NCCL INFO Trees [0] 6/-1/-1->5->4|4->5->6/-1/-1 [1] 6/-1/-1->5->4|4->5->6/-1/-1
ip-172-31-13-103:9743:9799 [5] NCCL INFO Setting affinity for GPU 5 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9744:9798 [6] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9744:9798 [6] NCCL INFO Trees [0] 7/-1/-1->6->5|5->6->7/-1/-1 [1] 7/-1/-1->6->5|5->6->7/-1/-1
ip-172-31-13-103:9744:9798 [6] NCCL INFO Setting affinity for GPU 6 to ffffff00,0000ffff,ff000000
ip-172-31-13-103:9748:9801 [7] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-13-103:9748:9801 [7] NCCL INFO Trees [0] -1/-1/-1->7->6|6->7->-1/-1/-1 [1] -1/-1/-1->7->6|6->7->-1/-1/-1
ip-172-31-13-103:9748:9801 [7] NCCL INFO Setting affinity for GPU 7 to ffffff00,0000ffff,ff000000
ip-172-31-6-104:9921:9986 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 8/8/64
ip-172-31-6-104:9921:9986 [0] NCCL INFO Trees [0] 9/-1/-1->8->1|1->8->9/-1/-1 [1] 9/-1/-1->8->-1|-1->8->9/-1/-1
ip-172-31-6-104:9921:9986 [0] NCCL INFO Setting affinity for GPU 0 to ff,ffff0000,00ffffff
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 00 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 00 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 00 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 00 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 00 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 00 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 00 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 00 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 00 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 00 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO Channel 01 : 4[901c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO Channel 01 : 12[901c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 00 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 3[201d0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9741:9803 [3] NCCL INFO Channel 01 : 3[201d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 00 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 11[201d0] via P2P/IPC/read
ip-172-31-6-104:9924:9983 [3] NCCL INFO Channel 01 : 11[201d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 00 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 00 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 7[a01d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 15[a01d0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9926:9982 [5] NCCL INFO Channel 01 : 13[901d0] -> 12[901c0] via P2P/IPC/read
ip-172-31-13-103:9743:9799 [5] NCCL INFO Channel 01 : 5[901d0] -> 4[901c0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO Channel 01 : 14[a01c0] -> 13[901d0] via P2P/IPC/read
ip-172-31-13-103:9742:9800 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9742:9800 [4] NCCL INFO comm 0x7f2ac8000dc0 rank 4 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO Channel 01 : 6[a01c0] -> 5[901d0] via P2P/IPC/read
ip-172-31-6-104:9925:9987 [4] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9925:9987 [4] NCCL INFO comm 0x7f8da8000dc0 rank 12 nranks 16 cudaDev 4 busId 901c0 - Init COMPLETE
ip-172-31-13-103:9743:9799 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9743:9799 [5] NCCL INFO comm 0x7f73a4000dc0 rank 5 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-6-104:9926:9982 [5] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9926:9982 [5] NCCL INFO comm 0x7f4328000dc0 rank 13 nranks 16 cudaDev 5 busId 901d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 15[a01d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 8[101c0] -> 1[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 10[201c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO Channel 01 : 15[a01d0] -> 14[a01c0] via P2P/IPC/read
ip-172-31-6-104:9928:9984 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9928:9984 [7] NCCL INFO comm 0x7f3a98000dc0 rank 15 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO Channel 01 : 10[201c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-6-104:9927:9985 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9927:9985 [6] NCCL INFO comm 0x7fa984000dc0 rank 14 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9924:9983 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9924:9983 [3] NCCL INFO comm 0x7f3bb8000dc0 rank 11 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 00 : 1[101d0] -> 8[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 2[201c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 7[a01d0] -> 8[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9921:9986 [0] NCCL INFO Channel 01 : 8[101c0] -> 9[101d0] via P2P/IPC/read
ip-172-31-13-103:9740:9802 [2] NCCL INFO Channel 01 : 2[201c0] -> 1[101d0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO Channel 01 : 1[101d0] -> 0[101c0] via P2P/IPC/read
ip-172-31-13-103:9741:9803 [3] NCCL INFO comm 0x7f6c4c000dc0 rank 3 nranks 16 cudaDev 3 busId 201d0 - Init COMPLETE
ip-172-31-6-104:9923:9981 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9923:9981 [2] NCCL INFO comm 0x7f5a24000dc0 rank 10 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9740:9802 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9740:9802 [2] NCCL INFO comm 0x7f36c8000dc0 rank 2 nranks 16 cudaDev 2 busId 201c0 - Init COMPLETE
ip-172-31-13-103:9739:9804 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9739:9804 [1] NCCL INFO comm 0x7fed38000dc0 rank 1 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [receive] via NET/AWS Libfabric/0
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 0[101c0] -> 9[101d0] [send] via NET/AWS Libfabric/0
ip-172-31-13-103:9748:9801 [7] NCCL INFO Channel 01 : 7[a01d0] -> 6[a01c0] via P2P/IPC/read
ip-172-31-13-103:9748:9801 [7] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9748:9801 [7] NCCL INFO comm 0x7fbd04000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
ip-172-31-13-103:9744:9798 [6] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9744:9798 [6] NCCL INFO comm 0x7f23b8000dc0 rank 6 nranks 16 cudaDev 6 busId a01c0 - Init COMPLETE
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 8[101c0] via P2P/IPC/read
ip-172-31-6-104:9921:9986 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9921:9986 [0] NCCL INFO comm 0x7fdc3c000dc0 rank 8 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [receive] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO Channel 01 : 9[101d0] -> 0[101c0] [send] via NET/AWS Libfabric/0
ip-172-31-6-104:9922:9980 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-6-104:9922:9980 [1] NCCL INFO comm 0x7f17c8000dc0 rank 9 nranks 16 cudaDev 1 busId 101d0 - Init COMPLETE
ip-172-31-13-103:9738:9797 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:9738:9797 [0] NCCL INFO comm 0x7faad8000dc0 rank 0 nranks 16 cudaDev 0 busId 101c0 - Init COMPLETE
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-13-103:9738:9738 [0] NCCL INFO Launch mode Parallel
           8             2   float     sum    79.04    0.00    0.00  4e-07    64.55    0.00    0.00  4e-07
          16             4   float     sum    63.60    0.00    0.00  4e-07    63.44    0.00    0.00  2e-07
          32             8   float     sum    64.62    0.00    0.00  2e-07    65.10    0.00    0.00  1e-07
          64            16   float     sum    65.01    0.00    0.00  1e-07    64.24    0.00    0.00  1e-07
         128            32   float     sum    65.24    0.00    0.00  1e-07    64.54    0.00    0.00  1e-07
         256            64   float     sum    65.57    0.00    0.01  1e-07    64.45    0.00    0.01  1e-07
         512           128   float     sum    66.94    0.01    0.01  1e-07    66.50    0.01    0.01  1e-07
        1024           256   float     sum    68.47    0.01    0.03  4e-07    68.17    0.02    0.03  4e-07
        2048           512   float     sum    72.68    0.03    0.05  4e-07    72.91    0.03    0.05  4e-07
        4096          1024   float     sum    78.42    0.05    0.10  4e-07    77.77    0.05    0.10  4e-07
        8192          2048   float     sum    83.65    0.10    0.18  4e-07    81.94    0.10    0.19  4e-07
       16384          4096   float     sum    96.39    0.17    0.32  4e-07    93.38    0.18    0.33  4e-07
       32768          8192   float     sum    116.7    0.28    0.53  4e-07    114.8    0.29    0.54  4e-07
       65536         16384   float     sum    155.3    0.42    0.79  4e-07    153.4    0.43    0.80  4e-07
      131072         32768   float     sum    203.0    0.65    1.21  4e-07    203.7    0.64    1.21  4e-07
      262144         65536   float     sum    315.6    0.83    1.56  4e-07    311.2    0.84    1.58  4e-07
      524288        131072   float     sum    409.9    1.28    2.40  4e-07    407.1    1.29    2.41  4e-07
     1048576        262144   float     sum    597.2    1.76    3.29  4e-07    594.6    1.76    3.31  4e-07
     2097152        524288   float     sum    926.9    2.26    4.24  4e-07    924.9    2.27    4.25  4e-07
     4194304       1048576   float     sum   1583.5    2.65    4.97  4e-07   1584.1    2.65    4.96  4e-07
     8388608       2097152   float     sum   2939.5    2.85    5.35  4e-07   2929.9    2.86    5.37  4e-07
    16777216       4194304   float     sum   5366.1    3.13    5.86  4e-07   5381.9    3.12    5.85  4e-07
    33554432       8388608   float     sum    10305    3.26    6.10  4e-07    10294    3.26    6.11  4e-07
    67108864      16777216   float     sum    20358    3.30    6.18  4e-07    20341    3.30    6.19  4e-07
   134217728      33554432   float     sum    39328    3.41    6.40  4e-07    39392    3.41    6.39  4e-07
   268435456      67108864   float     sum    77210    3.48    6.52  4e-07    77304    3.47    6.51  4e-07
   536870912     134217728   float     sum   152989    3.51    6.58  4e-07   152798    3.51    6.59  4e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 2.32362

How does ofi_iflush() work?

Hi. I'm a code monkey on the PSM3 team and I'm investigating an internal problem report and trying to understand if it's a real issue or not. It relates to ofi_iflush() issuing an RDMA read to, presumably, somehow force all outstanding I/Os to complete. It seems pretty obvious that the data being read is not, itself, important, and I'm wondering how adding another I/O to the queue guarantees that all outstanding I/Os complete?

Could you shed some light on this for me, please? Does NCCL assume that when the RDMA read completes all prior I/Os have also completed?

Restore PSM3 transport for libfabric in .travis.yml

Hello aws_ofi_nccl maintainers,

I noticed that PSM3 provider was removed from libfabric build in your Travis config file, probably due to #85
As the issue has been resolved already, I guess this is not longer needed, could you please revisit this and re-enable PSM3 if possible?

BRs,
Denis.

Support Ubuntu 22.04

Do we have any plans on supporting Ubuntu 22.04?

Mellanox and EFA in Docker Image

I'm attempting to assemble a single docker image to support both EFA and mellanox as we split workloads between different clouds, and it's easy to use the wrong image on the wrong cloud. I currently have something like this:

#####################################
# Install EFA and AWS-OFI-NCCL plugin
#####################################

ARG EFA_INSTALLER_VERSION=latest
ARG AWS_OFI_NCCL_VERSION=v1.5.0-aws

ENV LD_LIBRARY_PATH=/usr/local/cuda/extras/CUPTI/lib64:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib:/opt/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH
ENV PATH=/opt/amazon/openmpi/bin/:/opt/amazon/efa/bin:$PATH

RUN if [ -n "$CUDA_VERSION" ] ; then \
        cd /tmp && \
        curl -O https://efa-installer.amazonaws.com/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        tar -xf /tmp/aws-efa-installer-${EFA_INSTALLER_VERSION}.tar.gz && \
        cd aws-efa-installer && \
        apt-get update && \
        ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify && \
        rm -rf /tmp/aws-efa-installer* ; \
    fi

RUN if [ -n "$CUDA_VERSION" ] ; then \
        git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl && \
        cd /opt/aws-ofi-nccl && \
        git checkout ${AWS_OFI_NCCL_VERSION} && \
        ./autogen.sh && \
        ./configure --prefix=/opt/aws-ofi-nccl/install \
            --with-libfabric=/opt/amazon/efa/ \
            --with-cuda=/usr/local/cuda \
            --disable-tests && \
        make && make install ; \
    fi

###################################
# Mellanox OFED driver installation
###################################

ARG MOFED_VERSION

RUN if [ -n "$MOFED_VERSION" ] ; then \
        mkdir -p /tmp/mofed && \
        wget -nv -P /tmp/mofed http://content.mellanox.com/ofed/MLNX_OFED-${MOFED_VERSION}/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz && \
        tar -zxvf /tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64.tgz -C /tmp/mofed && \
        /tmp/mofed/MLNX_OFED_LINUX-${MOFED_VERSION}-ubuntu20.04-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force && \
        rm -rf /tmp/mofed ; \
    fi

and I either comment out the mellanox part of the EFA part depending on which image I want to build. When attempting to build both at the same time, it appears the mellanox installation wipes out something from EFA resulting in EFA not working. If I install EFA after mellanox, I encounter the following error:

The following packages have unmet dependencies:
   libibmad5-dbg : Depends: libibmad5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibnetdisc-dev : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibnetdisc5-dbg : Depends: libibnetdisc5 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   libibumad3-dbg : Depends: libibumad3 (= 43.0-1) but 55mlnx37-1.55103 is to be installed
   librdmacm1-dbg : Depends: librdmacm1 (= 43.0-1) but 55mlnx37-1.55103 is to be installed

I would love to get some insight into

is this possible
is there any documentation / guidance on how to do it?

Missed ability to not use memory registering if provider does not request it

Hello aws-ofi-nccl maintainers,

During our testing with PSM3 provider we faced with abnormal program termination in ofi_iflush phase.
Assumption that provider requires memory registering was reintroduced by recent changes.
Could you please consider merging #97 which fixes this?

Wanted to discuss the following topic: How to get PSM3 added to the continuous integration plan for aws-ofi-nccl plugin so mistakes like this are detected sooner?

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA)

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA).

Overview of issue

I've observed that on p3dn.24xlarge instances, multi-node pytorch training jobs using EFA and aws-ofi-nccl has worse performance on AL2, compared to an equivalent setup on Ubuntu 20.04.

AMI	EFA Enabled	Throughput (wpm)
AL2 Deep Learning Base AMI	Yes	~65000
AL2 Deep Learning Base AMI	No	~45000
Ubuntu Deep Learning Base AMI	Yes	~120000
Ubuntu Deep Learning Base AMI	No	~110000

~~The reason I don't have numbers for Ubuntu with EFA is because of #107~~.

Repro steps

Common Setup

On both AL2 and Ubuntu 20.04 instances, we're using p3dn.24xlarge in the same VPC, in the same placement group (cluster strategy). EFA is enabled on the network interfaces, and I've verified that EFA drivers are installed.

Dockerfile used for training job:

https://gist.github.com/yukunlin/382aa9aefc88e2cf093278a1fd42a1ce

fairseq_train_wrapped (built into the Dockerfile, a workaround from facebookresearch/fairseq#4302 to set rank):

#!/usr/bin/python
# -*- coding: utf-8 -*-
import os
import sys
from fairseq_cli.train import cli_main
if __name__ == "__main__":
   sys.argv += ["--local_rank", os.getenv("LOCAL_RANK")]
   sys.exit(cli_main())

CUDA version: 11.3

Training Data download and pre-processing following https://github.com/pytorch/fairseq/blob/main/examples/language_model/README.md

Two node setup used for training job.

AL2 Setup

AMI: Deep Learning Base AMI (Amazon Linux 2) Version 52.0, (ami-07f6f7cc742921659 in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
     fabric: EFA-fe80::58:92ff:fe3f:a7a3
     domain: rdmap0s6-rdm
     version: 114.0
     type: FI_EP_RDM
     protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::58:92ff:fe3f:a7a3
    domain: rdmap0s6-dgrm
    version: 114.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.7

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/mnt/fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env NCCL_SOCKET_IFNAME=eth0 \
   --env FI_PROVIDER=efa \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_manual_docker_al2" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_al2_efa.txt

Note that we're using the --device /dev/infiniband/uverbs0 flag to pass through EFA. For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Ubuntu Setup

AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.14

Training command (EFA enabled) (executed on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   --ulimit memlock=-1 \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu_efa" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Compared to the AL2 run, we add --ulimit memlock=-1 due to #107. Note that adding the same flag to the equivalent AL2 run makes no difference in performance.

For the non-EFA run, we drop the arguments --device /dev/infiniband/uverbs0 --env FI_PROVIDER=efa.

Results

AMI	EFA Enabled	Throughput (wpm)
AL2 Deep Learning Base AMI	Yes	~65000
AL2 Deep Learning Base AMI	No	~45000
Ubuntu Deep Learning Base AMI	Yes	~120000
Ubuntu Deep Learning Base AMI	No	~110000

Run Logs

AL2 (with EFA)

Interesting bits

[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO NET/OFI Selected Provider is efa
[0]:ip-10-0-0-175:16:16 [0] NCCL INFO Using network AWS Libfabric
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/634c600a11e36d1384215ab08366e774

AL2 (no EFA)

Initialization:

[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Bootstrap : Using eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported
[0]:
[0]:ip-10-0-0-175:17:17 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/IB : No device found.
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO NET/Socket : Using [0]eth0:10.0.0.175<0>
[0]:ip-10-0-0-175:17:17 [0] NCCL INFO Using network Socket
[0]:NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/8c4298450299b33dd9a4c0559f50eccc

Ubuntu (EFA)

Initialization:

ip-10-0-0-115:74:74 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:74:74 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 4
ip-10-0-0-115:74:74 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-0-0-115:74:74 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3

Full logs: https://gist.github.com/yukunlin/ba8e41131abc1a7e4fb288b480d94b8f

Ubuntu (no EFA)

Initialization:

ip-10-0-0-115:73:73 [0] NCCL INFO Bootstrap : Using ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.2.0aws
ip-10-0-0-115:73:73 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1

ip-10-0-0-115:73:73 [0] ofi_init:1157 NCCL WARN NET/OFI Only EFA provider is supported

ip-10-0-0-115:73:73 [0] ofi_init:1208 NCCL WARN NET/OFI aws-ofi-nccl initialization failed
ip-10-0-0-115:73:73 [0] NCCL INFO NET/IB : No device found.
ip-10-0-0-115:73:73 [0] NCCL INFO NET/Socket : Using [0]ens5:10.0.0.115<0>
ip-10-0-0-115:73:73 [0] NCCL INFO Using network Socket
NCCL version 2.10.3+cuda11.3

Full log: https://gist.github.com/yukunlin/95a1036dba1c3a677f8f130e6cf23fbf

Can't create p4d instance with multiple EFA network interfaces.

p4d.24xlarge instances support up to four EFAs. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-amis.

I added 2 EFA network interface in p4d. It always reports an error.
Moreover we can't use Public IPs. Because Public IPs can only be assigned to instances with one network interface.

2020-12-11 17:14:32,501 f5db57be Thread-28 nccl(1/1) vm_util.py:407 DEBUG    Ran: {aws --output json ec2 run-instances --region=us-east-1 --client-token=72aa7d99-280b-4949-b19c-3ab70a8ad6cd --image-id=ami-0404ddec9491a5a31 --instance-type=p4d.24xlarge --key-name=perfkit-key-f5db57be --tag-specifications=ResourceType=instance,Tags=[{Key=timeout_utc,Value=20201212t021429z},{Key=create_time_utc,Value=20201212t011429z},{Key=benchmark,Value=nccl},{Key=perfkit_uuid,Value=f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a},{Key=owner,Value=tohaowu},{Key=benchmark_uid,Value=nccl0}] --network-interfaces=[{"InterfaceType": "efa", "DeviceIndex": 0, "SubnetId": "subnet-0911d87f03ca7d3e5", "Groups": ["sg-053c797b3233f857f"]}, {"InterfaceType": "efa", "DeviceIndex": 1, "SubnetId": "subnet-02ad934424869f529", "Groups": ["sg-053c797b3233f857f"]}] --block-device-mappings=[{"DeviceName": "/dev/sda1", "Ebs": {"DeleteOnTermination": true, "SnapshotId": "snap-003d530d53f605cf3", "VolumeSize": 105, "VolumeType": "gp2"}}] --placement=AvailabilityZone=us-east-1a,GroupName=perfkit-f5db57be-79b7ca8c4588}
ReturnCode:0,  WallTime:0:02.65s,  CPU:0.39s,  MaxMemory:57420kb
STDOUT: {
    "Groups": [],
    "Instances": [
        {
            "AmiLaunchIndex": 0,
            "ImageId": "ami-0404ddec9491a5a31",
            "InstanceId": "i-08f87c8e0c176485d",
            "InstanceType": "p4d.24xlarge",
            "KeyName": "perfkit-key-f5db57be",
            "LaunchTime": "2020-12-12T01:14:32.000Z",
            "Monitoring": {
                "State": "disabled"
            },
            "Placement": {
                "AvailabilityZone": "us-east-1a",
                "GroupName": "",
                "Tenancy": "default"
            },
            "PrivateDnsName": "ip-10-0-0-71.ec2.internal",
            "PrivateIpAddress": "10.0.0.71",
            "ProductCodes": [],
            "PublicDnsName": "",
            "State": {
                "Code": 0,
                "Name": "pending"
            },
            "StateTransitionReason": "",
            "SubnetId": "subnet-0911d87f03ca7d3e5",
            "VpcId": "vpc-0e77f7b3dc0a69e21",
            "Architecture": "x86_64",
            "BlockDeviceMappings": [],
            "ClientToken": "72aa7d99-280b-4949-b19c-3ab70a8ad6cd",
            "EbsOptimized": false,
            "EnaSupport": true,
            "Hypervisor": "xen",
            "NetworkInterfaces": [
                {
                    "Attachment": {
                        "AttachTime": "2020-12-12T01:14:32.000Z",
                        "AttachmentId": "eni-attach-01ed94ae6b3920106",
                        "DeleteOnTermination": true,
                        "DeviceIndex": 1,
                        "Status": "attaching"
                    },
                    "Description": "",
                    "Groups": [
                        {
                            "GroupName": "default",
                            "GroupId": "sg-053c797b3233f857f"
                        }
                    ],
                    "Ipv6Addresses": [],
                    "MacAddress": "0e:f1:b3:8c:2a:01",
                    "NetworkInterfaceId": "eni-08e48fabb32dd8641",
                    "OwnerId": "835761027970",
                    "PrivateDnsName": "ip-10-0-1-81.ec2.internal",
                    "PrivateIpAddress": "10.0.1.81",
                    "PrivateIpAddresses": [
                        {
                            "Primary": true,
                            "PrivateDnsName": "ip-10-0-1-81.ec2.internal",
                            "PrivateIpAddress": "10.0.1.81"
                        }
                    ],
                    "SourceDestCheck": true,
                    "Status": "in-use",
                    "SubnetId": "subnet-02ad934424869f529",
                    "VpcId": "vpc-0e77f7b3dc0a69e21",
                    "InterfaceType": "efa"
                },
                {
                    "Attachment": {
                        "AttachTime": "2020-12-12T01:14:32.000Z",
                        "AttachmentId": "eni-attach-091873659b0d0ccf3",
                        "DeleteOnTermination": true,
                        "DeviceIndex": 0,
                        "Status": "attaching"
                    },
                    "Description": "",
                    "Groups": [
                        {
                            "GroupName": "default",
                            "GroupId": "sg-053c797b3233f857f"
                        }
                    ],
                    "Ipv6Addresses": [],
                    "MacAddress": "0e:ee:ef:8d:4f:a7",
                    "NetworkInterfaceId": "eni-08536700350aed736",
                    "OwnerId": "835761027970",
                    "PrivateDnsName": "ip-10-0-0-71.ec2.internal",
                    "PrivateIpAddress": "10.0.0.71",
                    "PrivateIpAddresses": [
                        {
                            "Primary": true,
                            "PrivateDnsName": "ip-10-0-0-71.ec2.internal",
                            "PrivateIpAddress": "10.0.0.71"
                        }
                    ],
                    "SourceDestCheck": true,
                    "Status": "in-use",
                    "SubnetId": "subnet-0911d87f03ca7d3e5",
                    "VpcId": "vpc-0e77f7b3dc0a69e21",
                    "InterfaceType": "efa"
                }
            ],
            "RootDeviceName": "/dev/sda1",
            "RootDeviceType": "ebs",
            "SecurityGroups": [
                {
                    "GroupName": "default",
                    "GroupId": "sg-053c797b3233f857f"
                }
            ],
            "SourceDestCheck": true,
            "StateReason": {
                "Code": "pending",
                "Message": "pending"
            },
            "Tags": [
                {
                    "Key": "owner",
                    "Value": "tohaowu"
                },
                {
                    "Key": "benchmark_uid",
                    "Value": "nccl0"
                },
                {
                    "Key": "perfkit_uuid",
                    "Value": "f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a"
                },
                {
                    "Key": "timeout_utc",
                    "Value": "20201212t021429z"
                },
                {
                    "Key": "benchmark",
                    "Value": "nccl"
                },
                {
                    "Key": "create_time_utc",
                    "Value": "20201212t011429z"
                }
            ],
            "VirtualizationType": "hvm",
            "CpuOptions": {
                "CoreCount": 48,
                "ThreadsPerCore": 2
            },
            "CapacityReservationSpecification": {
                "CapacityReservationPreference": "open"
            },
            "MetadataOptions": {
                "State": "pending",
                "HttpTokens": "optional",
                "HttpPutResponseHopLimit": 1,
                "HttpEndpoint": "enabled"
            }
        }
    ],
    "OwnerId": "835761027970",
    "ReservationId": "r-098d70a7ef009aaf4"
}

STDERR:
2020-12-11 17:14:32,502 f5db57be Thread-28 nccl(1/1) vm_util.py:353 INFO     Running: aws --output json ec2 describe-instances --region=us-east-1 --filter=Name=client-token,Values=72aa7d99-280b-4949-b19c-3ab70a8ad6cd
2020-12-11 17:14:33,394 f5db57be Thread-28 nccl(1/1) vm_util.py:407 DEBUG    Ran: {aws --output json ec2 describe-instances --region=us-east-1 --filter=Name=client-token,Values=72aa7d99-280b-4949-b19c-3ab70a8ad6cd}
ReturnCode:0,  WallTime:0:00.87s,  CPU:0.40s,  MaxMemory:56720kb
STDOUT: {
    "Reservations": [
        {
            "Groups": [],
            "Instances": [
                {
                    "AmiLaunchIndex": 0,
                    "ImageId": "ami-0404ddec9491a5a31",
                    "InstanceId": "i-08f87c8e0c176485d",
                    "InstanceType": "p4d.24xlarge",
                    "KeyName": "perfkit-key-f5db57be",
                    "LaunchTime": "2020-12-12T01:14:32.000Z",
                    "Monitoring": {
                        "State": "disabled"
                    },
                    "Placement": {
                        "AvailabilityZone": "us-east-1a",
                        "GroupName": "",
                        "Tenancy": "default"
                    },
                    "PrivateDnsName": "",
                    "ProductCodes": [],
                    "PublicDnsName": "",
                    "State": {
                        "Code": 32,
                        "Name": "shutting-down"
                    },
                    "StateTransitionReason": "Server.InternalError",
                    "Architecture": "x86_64",
                    "BlockDeviceMappings": [],
                    "ClientToken": "72aa7d99-280b-4949-b19c-3ab70a8ad6cd",
                    "EbsOptimized": false,
                    "EnaSupport": true,
                    "Hypervisor": "xen",
                    "NetworkInterfaces": [],
                    "RootDeviceName": "/dev/sda1",
                    "RootDeviceType": "ebs",
                    "SecurityGroups": [],
                    "StateReason": {
                        "Code": "Server.InternalError",
                        "Message": "Server.InternalError: Internal error on launch"
                    },
                    "Tags": [
                        {
                            "Key": "owner",
                            "Value": "tohaowu"
                        },
                        {
                            "Key": "benchmark_uid",
                            "Value": "nccl0"
                        },
                        {
                            "Key": "perfkit_uuid",
                            "Value": "f5db57be-26112013-d3c7-4c94-81c7-1e0d55d1f34a"
                        },
                        {
                            "Key": "timeout_utc",
                            "Value": "20201212t021429z"
                        },
                        {
                            "Key": "benchmark",
                            "Value": "nccl"
                        },
                        {
                            "Key": "create_time_utc",
                            "Value": "20201212t011429z"
                        }
                    ],
                    "VirtualizationType": "hvm",
                    "CpuOptions": {
                        "CoreCount": 48,
                        "ThreadsPerCore": 2
                    },
                    "CapacityReservationSpecification": {
                        "CapacityReservationPreference": "open"
                    },
                    "HibernationOptions": {
                        "Configured": false
                    },
                    "MetadataOptions": {
                        "State": "pending",
                        "HttpTokens": "optional",
                        "HttpPutResponseHopLimit": 1,
                        "HttpEndpoint": "enabled"
                    }
                }
            ],
            "OwnerId": "835761027970",
            "ReservationId": "r-098d70a7ef009aaf4"
        }
    ]
}

./configure does not recognize home directory "~/"

If I pass --nccl-path=~/nccl/build, the make command complains about 'nccl_net.h' not found. I later found that I need to pass the complete path as /home/ubuntu/nccl/build. Automatically reconigizing ~ will make it easier for others to install.

Handling GDR-capable providers not requested for memory registration

Hello aws_ofi_nccl maintainers,

For GDR-capable providers which do not request memory registration (i.e. provide FI_HMEM but not FI_MR_HMEM or FI_MR_LOCAL), there is an issue in current implementation.

The function register_mr_buffers() keeps mr_handle intact, and as it is set to NULL by caller, mr_handle will be NULL.
Later during ofi_iflush(), NULL mr_handle are being treat as an error condition, which leads to returning ncclSystemError. But having NULL here is normal if provider did not ask for memory registration.
Later during ofi_closeRecv() mr_handle is being passed to fi_close(), and in this case it is NULL, this lead to segfault.

To fix this the following pull request was created #81
Please consider merging it.

BRs,
Denis

Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts

Setting FI_OPT_CUDA_API_PERMITTED to false or 0 doesn't seem to make a difference to the config scripts, it still treats it as if FI_OPT_CUDA_API_PERMITTED is set to 1.
I am unable to force usage of the FI_HMEM implementation of my libfabric provider.

I would greatly appreciate some clarity about how FI_OPT_CUDA_API_PERMITTED is used, it's relationship to FI_HMEM, and it's relationship to GPUDirect/GDRCopy if any.

2 node MPI hang on nccl_message_transfer with EFA 1.5 and Ubuntu 18.04

Launch Command:
/opt/amazon/openmpi/bin/mpirun -np 2 --host 172.0.1.23,172.0.1.161 -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH /usr/local/bin/nccl_message_transfer

Error:

INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
INFO: Function: main Line: 49: NET/OFI Process rank 0 started. NCCLNet device used on ip-172-0-1-23 is AWS Libfabric.
INFO: Function: main Line: 53: NET/OFI Received 1 network devices
INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0
INFO: Function: ofi_init Line: 686: NET/OFI Selected Provider is efa
INFO: Function: main Line: 49: NET/OFI Process rank 1 started. NCCLNet device used on ip-172-0-1-161 is AWS Libfabric.
INFO: Function: main Line: 53: NET/OFI Received 1 network devices
INFO: Function: main Line: 57: NET/OFI Server: Listening on dev 0
WARN: Function: create_nccl_ofi_component Line: 459: NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
WARN: Function: create_nccl_ofi_component Line: 459: NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument

MPI hangs, with no other output - looking at top with 100% CPU usage with nccl_message_transger on both nodes.
Additional error in dmesg

[ 1632.420956] infiniband efa_0: Failed to process command ALLOC_PD (opcode 14) comp_status 1 err -12
[ 1632.420957] infiniband efa_0: Failed to allocate pd[-12]
[ 1632.726850] infiniband efa_0: Failed to allocate pd[-12]

Environment:
Ubuntu 18.04
NVIDIA 430.26 CUDA 10.0
EFA 1.5

PATH=/opt/amazon/openmpi/bin:/opt/amazon/efa/bin/:/usr/local/cuda:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
LD_LIBRARY_PATH=/opt/nccl/lib:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:

EFA enabled:

ubuntu@ip-172-0-1-23:~$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::ca:e4ff:fec9:7534
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::ca:e4ff:fec9:7534
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::ca:e4ff:fec9:7534
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

Unable to register memory (register_mr_buffers:465)

I was stuck with this problem for many days. May I know how to debug such issues?

Currently, I am using all_to_all NCCL MPI (world_size=64) + EFA. The error is as follows:
NVIDIA/nccl#563 (NCCL engineer is not sure what's the problem and needs EFA's help)

[7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument

I'm using horovod with EFA, and the multi-node job hangs with

...
[1,26]<stdout>:ip-172-32-36-209:8716:9558 [2] NCCL INFO Ring 00 : 26[2] -> 30[6] via P2P/IPC
[1,25]<stdout>:ip-172-32-36-209:8715:9546 [1] NCCL INFO Ring 00 : 25[1] -> 27[3] via P2P/IPC
[1,29]<stdout>:ip-172-32-36-209:8719:9555 [5] NCCL INFO Ring 00 : 29[5] -> 31[7] via P2P/IPC
[1,28]<stdout>:ip-172-32-36-209:8718:9563 [4] NCCL INFO Ring 00 : 28[4] -> 29[5] via P2P/IPC
[1,27]<stdout>:ip-172-32-36-209:8717:9550 [3] NCCL INFO Ring 00 : 27[3] -> 26[2] via P2P/IPC
[1,7]<stdout>:ip-172-32-38-1:21471:22316 [7] NCCL INFO Ring 00 : 7 -> 8 [send] via NET/AWS Libfabric/0
[1,15]<stdout>:ip-172-32-34-121:70560:71394 [7] NCCL INFO Ring 00 : 15 -> 16 [send] via NET/AWS Libfabric/0
[1,30]<stdout>:ip-172-32-36-209:8720:9567 [6] NCCL INFO Ring 00 : 30[6] -> 28[4] via P2P/IPC
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO Ring 00 : 23 -> 24 [send] via NET/AWS Libfabric/0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO NET/OFI No NIC info for dev 0
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO include/net.h:24 -> 2
[1,23]<stdout>:
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] create_nccl_ofi_component:459 NCCL WARN NET/OFI Couldn't open AV. RC: -22, ERROR: Invalid argument
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO include/net.h:27 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO transport/net.cc:357 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:668 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:814 -> 2
[1,23]<stdout>:ip-172-32-45-165:72593:73427 [7] NCCL INFO init.cc:950 -> 2
[1,31]<stdout>:ip-172-32-36-209:8800:9549 [7] NCCL INFO Ring 00 : 31 -> 0 [send] via NET/AWS Libfabric/0
...

ubuntu@ip-172-32-38-1:~$ fi_info -p efa
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::7e:9eff:fed3:c48a
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

Support Amazon Linux 2023 (AL2023)

If this is already a supported OS, please feel free to close

Ref: https://github.com/aws/aws-ofi-nccl#requirements

Support FI_CONTEXT2

Hey folks, we at Cornelis Networks are interested in adopting this ofi-nccl plugin.

If we create a patch to support fi_context2 structures, would that be something you are willing to entertain?

Thanks!

try horovod: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory

Hello, I'm trying to test Horovod with EFA+nccl, but it was stuck when testing with multi nodes, I think the main error is: create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory.

[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,5]<stdout>:ip-172-31-6-189:155:696 [5] NCCL INFO Ring 01 : 5[5] -> 6[6] via P2P/IPC
[1,3]<stdout>:ip-172-31-6-189:153:691 [3] NCCL INFO Ring 01 : 3 -> 10 [send] via NET/AWS Libfabric/1
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 01 : 6[6] -> 4[4] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 01 : 7[7] -> 3[3] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 01 : 4[4] -> 7[7] via P2P/IPC
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO Ring 02 : 13 -> 4 [receive] via NET/AWS Libfabric/2
[1,6]<stdout>:ip-172-31-6-189:156:695 [6] NCCL INFO Ring 02 : 6[6] -> 7[7] via P2P/IPC
[1,7]<stdout>:ip-172-31-6-189:157:687 [7] NCCL INFO Ring 02 : 7[7] -> 5[5] via P2P/IPC
[1,4]<stdout>:
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO include/net.h:21 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO transport/net.cc:334 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:340 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:650 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:815 -> 2
[1,4]<stdout>:ip-172-31-6-189:154:697 [4] NCCL INFO init.cc:951 -> 2
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 01 : 12[4] -> 15[7] via P2P/IPC
[1,14]<stdout>:ip-172-31-3-127:43:574 [6] NCCL INFO Ring 01 : 14[6] -> 12[4] via P2P/IPC
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:17.0 path /sys/devices/pci0000:00/
[1,11]<stdout>:ip-172-31-3-127:40:580 [3] NCCL INFO Ring 01 : 11 -> 2 [send] via NET/AWS Libfabric/1
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO NET/OFI [4] getCudaPath dev 2 busId 0000:00:18.0 path /sys/devices/pci0000:00/
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO Ring 02 : 5 -> 12 [receive] via NET/AWS Libfabric/2
[1,13]<stdout>:ip-172-31-3-127:42:576 [5] NCCL INFO Ring 01 : 13[5] -> 14[6] via P2P/IPC
[1,12]<stdout>:
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] create_nccl_ofi_component:496 NCCL WARN NET/OFI Couldn't enable endpoint. RC: -12, ERROR: Cannot allocate memory
[1,12]<stdout>:ip-172-31-3-127:41:573 [4] NCCL INFO include/net.h:21 -> 2

some information which may be helpful:

I use efa 1.5.1, and fi_info -p efa works:

provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

 EFA installer version: 1.5.1
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-bin_1.8.0amzn1.0_amd64 libfabric-dev_1.8.0amzn1.0_amd64 libfabric1_1.8.0amzn1.0_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi_3.1.4-2_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-dbg_1.8.0amzn1.0_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64

I also test nccl all_reduce_perf and it works as well, to run:

curl http://169.254.169.254/latest/meta-data/local-ipv4 >> my-hosts &&
/opt/amazon/openmpi/bin/mpirun  
-x FI_PROVIDER=efa     
-x FI_EFA_TX_MIN_CREDITS=64     
-x NCCL_DEBUG=INFO      
-x NCCL_TREE_THRESHOLD=0        
--hostfile my-hosts -n 8 -N 8 
--mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 --bind-to none    
/opt/build/nccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 -c 1 -n 100

I get:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_nccl-efa-test.log

about Horovod, my command is

NCCL_DEBUG=INFO 
HOROVOD_NUM_NCCL_STREAMS=4 
horovodrun -np 16 -H localhost:8,172.31.3.127:8 
--mpi-args="-x PATH -x LD_LIBRARY_PATH -x FI_PROVIDER=efa -x FI_EFA_TX_MIN_CREDITS=64 -x NCCL_TREE_THRESHOLD=0" python3 /home/cluster/distributed-training/test_scripts/pytorch_synthetic_benchmark.py --model resnet101 --batch-size 32 |& grep -v "Read -1"

the complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.5.1_horovod-test.log

PS: In Fact, I would prefer to use EFA 1.8.3 (to maintain same test envrioment), but I got more error in this version:

[1,0]<stderr>:terminate called after throwing an instance of 'std::system_error'
[1,0]<stderr>:  what():  Resource deadlock avoided
[1,0]<stderr>:[ip-172-31-6-189:00789] *** Process received signal ***
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal: Aborted (6)
[1,0]<stderr>:[ip-172-31-6-189:00789] Signal code:  (-6)
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 0] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f17f5b34f20]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 1] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f17f5b34e97]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 2] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f17f5b36801]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 3] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8c957)[0x7f17f0d40957]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 4] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x92ab6)[0x7f17f0d46ab6]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 5] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x91b19)[0x7f17f0d45b19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 6] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a8)[0x7f17f0d46488]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 7] /lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10613)[0x7f17f0aac613]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[ 8] /lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x2b1)[0x7f17f0aacb71]
[1,0]<stderr>:[ip-172-31-6-189:00789] [ 9] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x37)[0x7f17f0d46d17]
[1,0]<stderr>:[ip-172-31-6-189:00789] [10] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x8ea19)[0x7f17f0d42a19]
[1,0]<stderr>:[ip-172-31-6-189:00789] [11] [1,0]<stderr>:/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd8dc)[0x7f17f0d718dc]
[1,0]<stderr>:[ip-172-31-6-189:00789] [12] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xaa8)[0x7f17c47176b8]
[1,0]<stderr>:[ip-172-31-6-189:00789] [13] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x43041)[0x7f17f5b39041]
[1,0]<stderr>:[ip-172-31-6-189:00789] [14] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(+0x4313a)[0x7f17f5b3913a]
[1,0]<stderr>:[ip-172-31-6-189:00789] [15] /opt/amazon/efa/lib/libfabric.so.1(+0x5ebbf)[0x7f16dbd7ebbf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[16] /opt/amazon/efa/lib/libfabric.so.1(+0xf3f2)[0x7f16dbd2f3f2]
[1,0]<stderr>:[ip-172-31-6-189:00789] [17] /opt/amazon/efa/lib/libfabric.so.1(fi_getinfo+0x45d)[0x7f16dbd2fa9d]
[1,0]<stderr>:[ip-172-31-6-189:00789] [18] /usr/local/lib/libnccl-net.so(+0x2045)[0x7f16e41cc045]
[1,0]<stderr>:[ip-172-31-6-189:00789] [1,0]<stderr>:[19] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1065)[0x7f17c47a2065]
[1,0]<stderr>:[ip-172-31-6-189:00789] [20] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0xf1b4f)[0x7f17c47a2b4f]
[1,0]<stderr>:[ip-172-31-6-189:00789] [21] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x245)[0x7f17c47633a5]
[1,0]<stderr>:[ip-172-31-6-189:00789] [22] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x54)[0x7f17c47634d4]
[1,0]<stderr>:[ip-172-31-6-189:00789] [23] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x71)[0x7f17c472e061]
[1,0]<stderr>:[ip-172-31-6-189:00789] [24] /usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0xa1)[0x7f17c472e371]
[1,0]<stderr>:[ip-172-31-6-189:00789] [25] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/horovod-0.18.2-py3.6-linux-x86_64.egg/horovod/torch/mpi_lib_v2.cpython-36m-x86_64-linux-gnu.so(+0x5e440)[0x7f17c470f440]
[1,0]<stderr>:[ip-172-31-6-189:00789] [26] [1,0]<stderr>:/usr/local/lib/python3.6/dist-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0xaecf)[0x7f17f2000ecf]
[1,0]<stderr>:[ip-172-31-6-189:00789] [27] /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f17f58de6db]
[1,0]<stderr>:[ip-172-31-6-189:00789] [28] [1,0]<stderr>:/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f17f5c1788f]
[1,0]<stderr>:[ip-172-31-6-189:00789] *** End of error message ***

complete log:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_horovod-test.log

and nccl all_reduce_perf test also works:
https://github.com/handar423/aws-efa-horovod-log/blob/master/1.8.3_nccl-efa-test.log

and fi_info -p efa, cat /opt/amazon/efa_installed_packages:

provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-rdm
    version: 2.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 2.0
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::4a:78ff:fef8:fa03
    domain: efa_0-dgrm
    version: 1.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

SHM transfer will be disabled because of ptrace protection.
To enable SHM transfer, please refer to the man page fi_efa.7 for more information.
Also note that turning off ptrace protection has security implications. If you cannot
turn it off, you can suppress this message by setting FI_EFA_ENABLE_SHM_TRANSFER=0

 EFA installer version: 1.8.3
# Debug packages installed: yes
# Packages installed:
ibacm_25.0-1_amd64 ibverbs-providers_25.0-1_amd64 ibverbs-utils_25.0-1_amd64 infiniband-diags_25.0-1_amd64 libfabric-aws-bin_1.9.0amzn1.1_amd64 libfabric-aws-dev_1.9.0amzn1.1_amd64 libfabric1-aws_1.9.0amzn1.1_amd64 libibmad-dev_25.0-1_amd64 libibmad5_25.0-1_amd64 libibnetdisc-dev_25.0-1_amd64 libibnetdisc5_25.0-1_amd64 libibumad-dev_25.0-1_amd64 libibumad3_25.0-1_amd64 libibverbs-dev_25.0-1_amd64 libibverbs1_25.0-1_amd64 librdmacm-dev_25.0-1_amd64 librdmacm1_25.0-1_amd64 openmpi40-aws_4.0.2-1_amd64 rdma-core_25.0-1_amd64 rdmacm-utils_25.0-1_amd64 libfabric1-aws-dbg_1.9.0amzn1.1_amd64 libibmad5-dbg_25.0-1_amd64 libibnetdisc5-dbg_25.0-1_amd64 libibumad3-dbg_25.0-1_amd64 libibverbs1-dbg_25.0-1_amd64 librdmacm1-dbg_25.0-1_amd64

Could someone please give me some suggestions about the right direction? Thank you very much!

What are some AI/ML workloads users can utilize to test performance of the plugin?

Hey AWS team,

At Cornelis Networks we have had good luck so far with the plugin. We are able to run all of NVIDIA's nccl performance tests with the plugin and our OPX libfabric provider!

We want to start running some real pytorch/tensorflow workloads and assess performance for some 'real-world' applications. I was hoping you'd be able to point me towards some apps/workloads that you folks use for performance benchmarking :) I noticed in #240 that someone mentioned the 'PyTorch-FSDP' workload, more examples similar to that would be greatly appreciated.

Thanks again for accepting our patches! Also, if there is a more appropriate forum for general questions like this (email, slack, etc), please let me know.

[Question] Is RDMA available on p3dn instances?

Looking at the docs and having run some test jobs on p3dn instances, it looks like GPU RDMA over EFA is not available on p3dn instances and only on p4d? Is this correct?

Some observations on p3dn (running a fairseq - pytorch - training job using the nccl backend):

NCCL_ALGO = tree is faster than NCCL_ALGO=ring (assuming no RDMA on p3dn, I'm thinking this is due to two copies: DtoH -> HtoH (inter node, over EFA) -> HtoD)

With NCCL_DEBUG=INFO we're observing debug logs like:

 [0]:ip-10-0-0-157:24:146 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0

 # -- NOT --
 [0]:ip-10-0-0-157:24:146 [0] NCCL INFO Channel 00 : 12[1a0] -> 0[160] [receive] via NET/AWS Libfabric/0/GDRDMA

nv_peer_mem is not loaded:

 $ lsmod | grep nv
 root@ip-10-0-0-159:/# lsmod | grep nv
 nvidia_drm             61440  0
 nvidia_modeset       1196032  1 nvidia_drm
 nvidia_uvm           1138688  16
 nvidia              35262464  2858 nvidia_uvm,nvidia_modeset
 drm_kms_helper        184320  1 nvidia_drm
 drm                   425984  4 drm_kms_helper,nvidia,nvidia_drm
 i2c_core               77824  3 drm_kms_helper,nvidia,drm

NCCL WARN NET/OFI Only EFA provider is supported

Hi,

Apologies in advance if this is not the right place to ask.

I am trying to run PyTorch DDP with NCCL backend on SageMaker. I have my own Docker image which uses the following as a base
763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.13.1-gpu-py38-cu117-ubuntu20.04-sagemaker

This is configured as a SageMaker Training job through the AWS Console (I am working on someone else's setup, so no full control/understanding of the underlying details).

The instance is ml.p3.16xlarge with 8 V100 GPUs. I don't need node-to-node communication, just communication between the GPUs on a single node. I have no issues running my code on EC2 instance (ml.p3.16xlarge, but not using the docker image directly)

Running my job, I see the following warnings in the logs. and then the job seems to "hang".

algo-1:249:249 [0] ofi_init:1288 NCCL WARN NET/OFI Only EFA provider is supported
algo-1:249:249 [0] ofi_init:1339 NCCL WARN NET/OFI aws-ofi-nccl initialization failed

Could someone help me figure out the source of the issue here? I am not too familiar with EFA but digging through documentation here I see it's only supported on p3dn.24xlarge and p4d.24xlarge instances, which is not what I need. Is this a configuration issue with the container? Why is only EFA provider supported through NCCL?

Any pointers would be really appreciated.

Performance degradation for buffer size starting from ~46MB

Hi there,

I am running nccl-tests on two p4d instances, I noticed the performance drop significantly for buffer size 64MB, as shown in following log:

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-13-103:7699:7699 [0] NCCL INFO Launch mode Parallel
     8388608       2097152   float     sum   1321.5    6.35   11.90  5e-07   1202.2    6.98   13.08  5e-07
    16777216       4194304   float     sum   1896.7    8.85   16.59  5e-07   1872.0    8.96   16.80  5e-07
    33554432       8388608   float     sum   1931.6   17.37   32.57  5e-07   1963.7   17.09   32.04  5e-07
    67108864      16777216   float     sum   6586.3   10.19   19.10  5e-07   6591.5   10.18   19.09  5e-07
   134217728      33554432   float     sum   8189.9   16.39   30.73  5e-07   8157.7   16.45   30.85  5e-07
   268435456      67108864   float     sum    11812   22.72   42.61  5e-07    11919   22.52   42.23  5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 25.6327 
#

So I use I fixed step size 2MB, and experiment with buffer size from 32MB to 100MB.
As shown in the log, we can see the bus bandwidth decreasing to 19GB/s start from 46MB (48234496).
This is interesting. Any idea why this happened?

#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
ip-172-31-13-103:7852:7852 [0] NCCL INFO Launch mode Parallel
ip-172-31-13-103:7862:7914 [7] NCCL INFO 8 coll channels, 8 p2p channels, 1 p2p channels per peer
ip-172-31-13-103:7862:7914 [7] NCCL INFO comm 0x7f8aec000dc0 rank 7 nranks 16 cudaDev 7 busId a01d0 - Init COMPLETE
    33554432       8388608   float     sum   1926.3   17.42   32.66  5e-07   1887.0   17.78   33.34  5e-07
    35651584       8912896   float     sum   2040.2   17.47   32.76  5e-07   1984.1   17.97   33.69  5e-07
    37748736       9437184   float     sum   2203.1   17.13   32.13  5e-07   2059.0   18.33   34.38  5e-07
    39845888       9961472   float     sum   2237.6   17.81   33.39  5e-07   2222.5   17.93   33.62  5e-07
    41943040      10485760   float     sum   2339.2   17.93   33.62  5e-07   2327.6   18.02   33.79  5e-07
    44040192      11010048   float     sum   2408.5   18.28   34.28  5e-07   2420.2   18.20   34.12  5e-07
    46137344      11534336   float     sum   2521.8   18.30   34.30  5e-07   2515.4   18.34   34.39  5e-07
    48234496      12058624   float     sum   4953.8    9.74   18.26  5e-07   4861.2    9.92   18.60  5e-07
    50331648      12582912   float     sum   5067.2    9.93   18.62  5e-07   5134.0    9.80   18.38  5e-07
    52428800      13107200   float     sum   5293.5    9.90   18.57  5e-07   5254.6    9.98   18.71  5e-07
    54525952      13631488   float     sum   5424.3   10.05   18.85  5e-07   5457.3    9.99   18.73  5e-07
    56623104      14155776   float     sum   5651.5   10.02   18.79  5e-07   5613.9   10.09   18.91  5e-07
    58720256      14680064   float     sum   5805.5   10.11   18.96  5e-07   5835.1   10.06   18.87  5e-07
    60817408      15204352   float     sum   6018.0   10.11   18.95  5e-07   6011.0   10.12   18.97  5e-07
    62914560      15728640   float     sum   6218.5   10.12   18.97  5e-07   6259.8   10.05   18.84  5e-07
    65011712      16252928   float     sum   6467.3   10.05   18.85  5e-07   6437.6   10.10   18.94  5e-07
    67108864      16777216   float     sum   6579.8   10.20   19.12  5e-07   6583.1   10.19   19.11  5e-07
    69206016      17301504   float     sum   6817.4   10.15   19.03  5e-07   6832.4   10.13   18.99  5e-07
    71303168      17825792   float     sum   7018.6   10.16   19.05  5e-07   7090.8   10.06   18.85  5e-07
    73400320      18350080   float     sum   7231.0   10.15   19.03  5e-07   7199.5   10.20   19.12  5e-07
    75497472      18874368   float     sum   7402.5   10.20   19.12  5e-07   7369.1   10.25   19.21  5e-07
    77594624      19398656   float     sum   7564.9   10.26   19.23  5e-07   7607.6   10.20   19.12  5e-07
    79691776      19922944   float     sum   7791.7   10.23   19.18  5e-07   7788.8   10.23   19.18  5e-07
    81788928      20447232   float     sum   7958.2   10.28   19.27  5e-07   7949.7   10.29   19.29  5e-07
    83886080      20971520   float     sum   8144.1   10.30   19.31  5e-07   8171.6   10.27   19.25  5e-07
    85983232      21495808   float     sum   8342.2   10.31   19.33  5e-07   8385.3   10.25   19.23  5e-07
    88080384      22020096   float     sum   8601.3   10.24   19.20  5e-07   8600.0   10.24   19.20  5e-07
    90177536      22544384   float     sum   8814.6   10.23   19.18  5e-07   8802.5   10.24   19.21  5e-07
    92274688      23068672   float     sum   8982.6   10.27   19.26  5e-07   9002.8   10.25   19.22  5e-07
    94371840      23592960   float     sum   9199.3   10.26   19.23  5e-07   9166.0   10.30   19.30  5e-07
    96468992      24117248   float     sum   9360.9   10.31   19.32  5e-07   9420.4   10.24   19.20  5e-07
    98566144      24641536   float     sum   9589.1   10.28   19.27  5e-07   9586.6   10.28   19.28  5e-07
   100663296      25165824   float     sum   9769.6   10.30   19.32  5e-07   9762.1   10.31   19.33  5e-07
   102760448      25690112   float     sum   9997.1   10.28   19.27  5e-07   9980.3   10.30   19.31  5e-07
   104857600      26214400   float     sum    10162   10.32   19.35  5e-07    10208   10.27   19.26  5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 21.9716 
#

Is it safe to use OFI_NCCL_GDR_FLUSH_DISABLE?

Hi team,

The README provides this environment variable OFI_NCCL_GDR_FLUSH_DISABLE.

Is it safe to set it to 1? i.e. disable GDR flush for this plugin. Would setting it to 1 improve the performance?

Thank you!

Cc @rohan-varma @zhaojuanmao @pbelevich

Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument

I have this error, and want to know how to slove it.

$ mpirun -n 2 --host node13,node14 ./nccl_message_transfer
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
TRACE: Function: main Line: 58: NET/OFI Using CUDA device 0 for memory allocation
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1006: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
TRACE: Function: find_ofi_provider Line: 525: NET/OFI Could not find any optimal provider supporting GPUDirect RDMA
INFO: Function: ofi_init Line: 1033: NET/OFI Selected Provider is psm2
INFO: Function: main Line: 69: NET/OFI Process rank 0 started. NCCLNet device used on node13 is AWS Libfabric.
INFO: Function: main Line: 73: NET/OFI Received 1 network devices
INFO: Function: ofi_pciPath Line: 1094: NET/OFI No NIC info for dev 0
INFO: Function: ofi_getProperties Line: 1194: NET/OFI No NIC info for dev 0. Supplying default values for NIC properties.
TRACE: Function: print_dev_props Line: 78: NET/OFI ****************** Device 0 Properties ******************
TRACE: Function: print_dev_props Line: 79: NET/OFI hfi1_0;hfi1_1: PCIe Path: (null)
TRACE: Function: print_dev_props Line: 80: NET/OFI hfi1_0;hfi1_1: Plugin Support: 1
TRACE: Function: print_dev_props Line: 81: NET/OFI hfi1_0;hfi1_1: Device GUID: 0
TRACE: Function: print_dev_props Line: 82: NET/OFI hfi1_0;hfi1_1: Device Speed: 0
TRACE: Function: print_dev_props Line: 83: NET/OFI hfi1_0;hfi1_1: Device Port: 1
TRACE: Function: print_dev_props Line: 84: NET/OFI hfi1_0;hfi1_1: Device Maximum Communicators: 65535
TRACE: Function: main Line: 104: NET/OFI Rank 0 uses 0 device for communication
INFO: Function: main Line: 114: NET/OFI Server: Listening on dev 0
WARN: Function: create_nccl_ofi_component Line: 708: NET/OFI Couldn't allocate endpoint. RC: -22, ERROR: Invalid argument
WARN: Function: main Line: 115: NET/OFI OFI NCCL failure: 2

$ fi_info -p psm2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2
provider: psm2
fabric: psm2
domain: psm2
version: 1.6
type: FI_EP_RDM
protocol: FI_PROTO_PSMX2

Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older.

Hello,

Using a fresh deployment of ubuntu 22.04 AMI on p4d.24xlarge instances and installing baremetal the CUDA stack followed by EFA using those commands:

curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz
tar -xf aws-efa-installer-latest.tar.gz
sudo ./aws-efa-installer/efa_installer.sh -y

Then running a container image that was built with those commands in its Dockerfile:

...
RUN cd /root \
    && curl -O https://efa-installer.amazonaws.com/aws-efa-installer-latest.tar.gz \
    && tar -xf /root/aws-efa-installer-latest.tar.gz \
    && cd aws-efa-installer \
    && apt-get update \
    && apt-get install -y libhwloc-dev \
    && ./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify \
    && apt-get install -y libfabric-bin \
    && rm -rf /var/lib/apt/lists/*

RUN git clone https://github.com/aws/aws-ofi-nccl.git /opt/aws-ofi-nccl \
    && cd /opt/aws-ofi-nccl \
    && git checkout v1.7.1-aws \
    && ./autogen.sh \
    && ./configure --prefix=/opt/aws-ofi-nccl/ \
       --with-libfabric=/opt/amazon/efa/ \
       --with-cuda=/usr/local/cuda \
    && make && make install
...

We get this error running the NCCL tests between 2 containers (one on each instance):

configure_sendrecv_inorder:213 NCCL WARN NET/OFI Couldn't set FI_OPT_EFA_SENDRECV_IN_ORDER_ALIGNED_128_BYTES. RC: -92, ERROR: Protocol not available
nccl_net_ofi_init:1163 NCCL WARN NET/OFI aws-ofi-nccl initialization failed

Running fi_info -p efa -t FI_EP_RDM we get:

provider: efa
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-rdm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa;ofi_rxd
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-dgrm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_RXD

^^ We see 2 provider sections per EFA adapter (this example is for 1 adapter per instance).

WORKAROUND:
To solve this issue, we need to reinstall efa on the containers with:

./efa_installer.sh -y -k --uninstall
./efa_installer.sh -y -g -d --skip-kmod --skip-limit-conf --no-verify

Then the NCCL tests will work fine with EFA and for some reason fi_info -p efa -t FI_EP_RDM will return a single provider section per EFA adapter:

provider: efa
    fabric: EFA-fe80::8d3:d9ff:fe9a:6059
    domain: rdmap197s0-rdm
    version: 111.0
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA

Do you know why 2 libfabric provider sections show up in the broken scenario and why we need to re-install EFA on the containers after running them?

[Feature Request]Allow custom NCCL_TOPO_FILE location

Currently on the aws branch, ofi will use the --prefix location during configure to store the xml topology files. It will set XML_DIR to prefix value during compile time. https://github.com/aws/aws-ofi-nccl/blob/aws/src/Makefile.am#L8

During runtime it will set NCCL_TOPO_FILE according to the XML_DIR value: https://github.com/aws/aws-ofi-nccl/blob/aws/src/nccl_ofi_net.c#L1178

This will create a problem when I build OFI plugin library at one place (a CI system), and package it and try to run it on another machine (e.g. via conda install to a $CONDA_PREFIX).

If build path (--prefx) and $CONDA_PREFIX does not match, OFI will fail to find topology file and result in a regression.

Could you update the logic to use NCCL_TOPO_FILE value if there is a preset environment variable? This will allow me to configure OFI to read XML files at any location. Also add logging on which topology file is used.

I'm able to workaround this issue by using Conda's binary relocation patch, which will change the $PREFIX value during installation. But this is still a good feature to have.

Thanks!

Error (and crash) when using EFA from docker running on ubuntu AMI

Overview of issue

I have a docker image with EFA and aws-ofi-nccl installed. This image "works" with EFA when running on an AL2 AMI (but is slow, see #106). However, when the same image is run on an ubuntu AMI, we get an error message:

[0]:Unable to write to EQ: Missing or unavailable event queue. err: Unknown error -12 (-12) prov_errno: Unknown error -12 (-12) prov/efa/src/rxr/rxr.h:993

Repro Steps

Ubuntu Setup

Instance type: p3dn.24xlarge (EFA enabled)

AMI: AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) 20220403, (ami-061dac75dbd529aef in us-west-2)

Nvidia driver version: 510.47.03
CUDA version: 11.6 (shouldn't be a factor because we're running the job from docker)

/opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::da:b9ff:fe04:8af
    domain: rdmap0s6-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

nvidia-docker version: 20.10.14

Training command (exectued on both training nodes):

nvidia-docker run \
   --mount type=bind,src=/home/ubuntu/ps-fsx,dst=/job \
   --network host \
   --device /dev/infiniband/uverbs0 \
   --env FI_PROVIDER=EFA \
   --env NCCL_SOCKET_IFNAME=ens5 \
   --env LOGLEVEL=INFO \
   --env NCCL_PROTO=simple \
   --env NCCL_DEBUG=INFO \
   919560170281.dkr.ecr.us-west-2.amazonaws.com/yukunlin-test-snap:manual_docker5 \
   python -m torch.distributed.launch --rdzv_backend c10d --rdzv_endpoint $MASTER_IP:29500 --rdzv_id 'foobar' --nnodes 2 --nproc_per_node 8 --role '' \
   --master_addr=$MASTER_IP --master_port=12345 \
   fairseq_train_wrapped \
   --task language_modeling \
  /job/fairseq/data-bin/wikitext-103 \
  --save-dir "/job/fairseq/checkpoints/transformer_wikitext-103_ubuntu" \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 \
  --max-update 50000 2>&1 | tee ~/output_ubuntu_efa.txt

Full log: https://gist.github.com/yukunlin/dd5e5ee6e41f84a696e76b74e75c65d0

Other Observations

This seems related to #44. I did follow #44 (comment) and ran nccl-test on the AMI successfully, which indicates EFA is working (on the AMI at least).

Note that the docker image works (it doesn't crash) when running from an AL2 AMI (see #106)

DataLoader crash when using FI_EFA_USE_DEVICE_RDMA=1

Our AWS p4d.24xlarge job passed on 08/24, and the throughput was 3511 samples/second.
We used two p4d.24xlarges with FI_PROVIDER="efa" and FI_EFA_USE_DEVICE_RDMA=1

This test failed recently. The error message is following

File "/workspace/bert/run_pretraining.py", line 1592, in
args, final_loss, train_time_raw = main()
File "/workspace/bert/run_pretraining.py", line 1344, in main
for step, batch in enumerate(train_dataloader):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 356, in iter
return self._get_iterator()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 302, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 941, in init
self._reset(loader, first_iter=True)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 972, in _reset
self._try_put_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1206, in _try_put_index
index = self._next_index()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 509, in _next_index
return next(self._sampler_iter) # may raise StopIteration
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 226, in iter
for idx in self.sampler:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 124, in iter
yield from torch.randperm(n, generator=generator).tolist()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 827) is killed by signal: Segmentation fault.

When test without FI_EFA_USE_DEVICE_RDMA=1, the test passes. But the throughput is 1673 samples/sec.

This is the dockerfile we used.
https://github.com/aws-samples/aws-efa-nccl-baseami-pipeline/blob/master/nvidia-efa-docker_base/Dockerfile.base

PyTorch Distributed Training crashes with "Cannot allocate memory (-12)"

Hi All,

I am running PyTorch distributed training (code here) on 4 AWS A100 nodes with EFA. We got an error of

Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683

when launching the experiments. Using ethernet gives correct results thus we think that the issue happens in EFA. Could you please take a look on it?

Here are the library versions:

PyTorch 1.12.0
CUDA: 11.6
NCCL: 2.10.3
aws-ofi-nccl: 1.3.0
libfabric: libfabric.so.1.19.0
cuda-drivers-fabricmanager-510
cuda-drivers-510

Full log before crashes:

ip-10-216-179-193:964326:964326 [0] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964326:964326 [0] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964326:964326 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.6
ip-10-216-179-193:964328:964328 [2] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964332:964332 [6] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964331:964331 [5] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964327:964327 [1] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964330:964330 [4] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964329:964329 [3] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964333:964333 [7] NCCL INFO Bootstrap : Using ens32:10.216.179.193<0>
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Running on p4d.24xlarge platform, Setting NCCL_TOP
O_FILE environment variable to /sensei-fs/users/hatan/libs/aws-ofi-nccl/share/aws-ofi-nccl/xml/p4d-24xl-
topo.xml
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
ip-10-216-179-193:964328:964328 [2] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964328:964328 [2] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964331:964331 [5] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964331:964331 [5] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964330:964330 [4] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964330:964330 [4] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964332:964332 [6] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964332:964332 [6] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964327:964327 [1] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964327:964327 [1] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964329:964329 [3] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964329:964329 [3] NCCL INFO Using network AWS Libfabric
ip-10-216-179-193:964333:964333 [7] NCCL INFO NET/OFI Selected Provider is efa
ip-10-216-179-193:964333:964333 [7] NCCL INFO Using network AWS Libfabric
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Unable to write to EQ: Missing or unavailable event queue. err: Cannot allocate memory (-12) prov_errno:
 Cannot allocate memory (-12) ./prov/efa/src/rxr/rxr.h:683
Traceback (most recent call last):
  File "main.py", line 177, in <module>
    devjobs_multi_node_main(args)
  File "main.py", line 77, in devjobs_multi_node_main
    mp.spawn(main, nprocs=args.gpus, args=(args,))
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 198, in start_processes
    while not context.join():
  File "/sensei-fs/users/hatan/libs/env_efa/lib/python3.8/site-packages/torch/multiprocessing/spawn.py",
 line 140, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGABRT

Let me know if any other information or some other tests can help. Thanks!

Best,
Hao

Unable to find libcudart.so (1.7.1)

When running nccl-tests, I see the following error:

nccl-tests-worker-0:39:45 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.7.1-aws
nccl-tests-worker-0:39:45 [0] nccl_net_ofi_cuda_init:39 NCCL WARN NET/OFI Failed to find CUDA Runtime library: libcudart.so: cannot open shared object file: No such file or directory

The base image nvidia/cuda:12-runtime-ubuntu:22.04 only includes the file /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12, but not libcudart.so. Adding the symlink as below works around the issue, but it would be good if it worked out of the box, by looking for libcudart.so.*, if this is possible?

RUN ln -s /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so.12 /usr/local/cuda-12.0/targets/x86_64-linux/lib/libcudart.so

rdma/fabric.h not found

I've installed EFA driver succesfully on AWS EC2 Deep Learning AMI. I've also built NCCL 2.4.2. When building aws-ofi-nccl, I get the following error during configure

[ec2-user@ip-10-0-48-207 aws-ofi-nccl]$ ./configure --with-cuda=/usr/local/cuda/ --with-nccl=$NCCL_HOME --with-mpi=/opt/amazon/efa/bin/
checking whether make supports nested variables... yes
checking for gcc... gcc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking whether make sets $(MAKE)... yes
checking for ar... ar
checking the archiver (ar) interface... ar
checking how to run the C preprocessor... gcc -E
checking for grep that handles long lines and -e... /bin/grep
checking for egrep... /bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for stdbool.h that conforms to C99... yes
checking for _Bool... yes
checking for inline... inline
checking for size_t... yes
checking for ssize_t... yes
checking for uint64_t... yes
checking for stdlib.h... (cached) yes
checking for GNU libc compatible malloc... yes
checking for memset... yes
checking for realpath... yes
checking limits.h usability... yes
checking limits.h presence... yes
checking for limits.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for unistd.h... (cached) yes
checking rdma/fabric.h usability... no
checking rdma/fabric.h presence... no
checking for rdma/fabric.h... no
configure: error: unable to find required headers
[ec2-user@ip-10-0-48-207 aws-ofi-nccl]$ make -j 32 NCCL_HOME=$NCCL_HOME
make: *** No targets specified and no makefile found.  Stop.

I have libfabric source, but if I make and install it, it overwrites EFA libfabric. Any ideas?

Running nccl-perf tests documentation is missing MPI instructions

I ran into a bit of a headache because I wasn't referring to the nccl-tests documentation when attempting to build the performance tests.

All I needed was to add MPI=1 MPI_HOME=/path/to/mpi and things worked well :)

Do you folks think it's reasonable to add this to the README.md? If so, I'll open a PR to close this out.

Something like:

2. Build the tests

cd nccl-tests/
make NCCL_HOME=~/nccl/build MPI=1 MPI_HOME=$MPI_HOME

Error running NCCL Tests via MPiJob: "prov_err Not a Directory"

Running nccl tests on image public.ecr.aws/w6p6i9i7/aws-efa-nccl-rdma:22.03-pt-py3 results in results in the following errors.

[1,13]<stdout>:nccl-tests-worker-1:35:89 [5] ofi_process_cq:1039 NCCL WARN NET/OFI Request 0x7fbb944701b8 completed with error. RC: 20. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 1, size: 0, state: CREATED, direction: SEND } [1,13]<stderr>:libfabric:35:1668017255::efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 20, prov_err: Not a directory (20)

Full error log: https://gist.github.com/Csinclair0/48ee4f1388d4901e6958069ee272a305
MPIJob spec : https://gist.github.com/Csinclair0/18dcc46fb3ae98c189b9c5dcd56ead2f

Deadlock for rendezvous providers

Hello aws_ofi_nccl maintainers,

NCCL, during its sequence of GDR capability checking, calls ofi_listen, ofi_connect and ofi_accept sequentially.
ofi_connect issues non-blocking fi_tsend to send small buffer of data to recipient.
Later it tries to ensure that data was actually transferred.
Underlying provider detects that caller wants to talk to self.
In case if it implements rendezvous mode for talking to self, transfer considered complete only if caller actually received buffer, i.e. ofi_accept must be called before ofi_connect. At the same time accepting considered complete only when data actually there when accept was called. Deadlock.

To fix this the following pull request was created #84
Please consider merging it.

BRs,
Denis

Misleading comparison on unsigned integer

aws-ofi-nccl/src/nccl_ofi_freelist.c

Line 156 in 928ce18

if (allocation_count <= 0) {

allocation_count is unsigned so can't be < 0.

optimizing NCCL performance for Deep Learning workloads on AWS EFA network

Moving from NVIDIA/nccl#235

Summary of findings.

TLDR; 8 machines throughput is 70% of on-prem infiniband, 16 machines is 60%

Potential use after free in ofi_connect()

The fi_tsend() call in ofi_connect() uses a local stack variable (local_ep_addr) as the send buffer. This buffer must remain valid until the completion for this send has been reaped, however, the function exits without ensuring this. This could result in a corrupted connect message if the memory is reused and overwritten before the provider has sent the data.

Either an inject-style send should be used, or the local_ep_addr needs to live on the heap somewhere.

could not find header file nccl_net.h when installing from downloaded release

[ec2-user@ip-172-31-23-78 aws-ofi-nccl-0.9]$ make
mpicc -I/opt/nccl/include -I/usr/local/cuda/include -I/opt/libfabric/include -I./include/ -o /home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring -L/home/ec2-user/oss/aws-ofi-nccl-0.9 -lnccl-net 		\
-L/opt/libfabric/lib -lfabric -L/usr/local/cuda/lib64 -lcudart 	\
/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring.c -ldl
In file included from /home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring.c:5:0:
/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/test-common.h:11:10: fatal error: nccl_net.h: No such file or directory
 #include <nccl_net.h>
          ^~~~~~~~~~~~
compilation terminated.
make: *** [/home/ec2-user/oss/aws-ofi-nccl-0.9/tests/ring] Error 1

WARNING: unrecognized options: --with-nccl when attempting to install

I am attempting to build off an existing Dockerfile to add EFA multinode support for some runs I would like to do on AWS. My dockerfile can be found here. I am looking at this guide and this example dockerfile.

I would like to avoid rebuilding pytorch from source, so I am instead trying to clone the same version of nccl and then build the aws-ofi-nccl plugin pointing to that version of nccl. While doing this, I am hitting the following error:

configure: WARNING: unrecognized options: --with-nccl

on the following command:

./configure --prefix=/opt/aws-ofi-nccl/install --with-libfabric=/opt/amazon/efa/ --with-cuda=/usr/local/cuda --with-nccl=/opt/nccl/build --with-mpi=/opt/amazon/openmpi/

It appears the latest on branch aws has changed to no longer have this flag. Are there updated instructions on how to install this

Crash with multirail providers.

Hello aws-ofi-nccl maintainers,

In case if underlying provider (PSM3 in our case) is able to provide support for multi-rail, it typically exposes virtual device with NULL name. Plugin blindly strdups this name and in case of NULL there segfault happens.
#101 fixes this.
Could you please consider merging it?

BRs,
Denis.

Clarification needed for NCCL setting recommendation

Hi Team,

Step 8 of this EFA guidance seems to recommend the following NCCL settings:

NCCL_ALGO=ring — enables ring algorithm for collective operations.
NCCL_PROTO=simple — instructs NCCL to use a simple protocol for communication. Currently, the EFA provider does not support LL protocols. Enabling them could lead to data corruption.

Are they still the recommended settings today? In this recent issue, the recommendation seems to have changed to NOT changing any NCCL defaults: #65 (comment)

Any clarification would be much appreciated!

Specifically, does EFA support LL or LL128 today? Or do users still need to set protocol to Simple to avoid data corruptions?

Thank you!

Cc @rohan-varma @zhaojuanmao @pbelevich

Throughput decrease

I tried EFA 1.7.0 in early this year and NCCL-test all reduced, and the max out of place algbw is 104 Gpbs.
For EFA 1.7.1, the max out of place algbw is 65 Gpbs

Would you help me check it?

EFA is not enabled in P4DN after upgrading NCCL v 2.10.3 and PyTorch v1.10 (master)

In P4DN 2 nodes, I upgraded NCCL to v2.10.3 and PyTorch to v1.10. However, EFA is not enabled even all related libraries (PyTorch, aws-ofi-nccl, nccl-tests) load the same NCCL library (located at /usr/lib/x86_64-linux-gnu).

The error logs from aws-ofi-nccl/tests and nccl_tests are as follows.

/opt/amazon/openmpi/bin/mpirun \
         -n 16 -N 8 --hostfile /job/hostfile \
         -x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/$USER/aws-ofi-nccl/install/lib:/home/$USER/aws-ofi-nccl:$LD_LIBRARY_PATH \
         -x FI_PROVIDER="efa" --mca btl_tcp_if_exclude lo,docker0 --bind-to none \
         /home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer

10.3.35.83: + /opt/amazon/openmpi/bin/mpirun -n 16 -N 8 --hostfile /job/hostfile -x LD_LIBRARY_PATH=/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/usr/local/cuda/lib64:/usr/local/cuda:/usr/local/cuda/lib:/deepspeed/deepspeed/aws-ofi-nccl/install/lib:/home/deepspeed/aws-ofi-nccl:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib:/usr/local/cuda-11.0/lib64:/opt/amazon/efa/lib64:/opt/amazon/openmpi/lib64:/usr/lib:/usr/local/lib::/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64 -x FI_PROVIDER=efa --mca btl_tcp_if_exclude lo,docker0 --bind-to none /home/deepspeed/aws-ofi-nccl/tests/nccl_message_transfer
10.3.35.83: Warning: Permanently added '[10.3.60.130]:2022' (RSA) to the list of known hosts.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: Primary job  terminated normally, but 1 process returned
10.3.35.83: a non-zero exit code. Per user-direction, the job has been aborted.
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: INFO: Function: ofi_init Line: 1076: NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: INFO: Function: ofi_init Line: 1102: NET/OFI Setting RDMAV_FORK_SAFE environment variable to 1.
10.3.35.83: WARN: Function: find_ofi_provider Line: 543: NET/OFI Couldn't find any optimal provider
10.3.35.83: WARN: Function: main Line: 66: NET/OFI OFI NCCL failure: 2
10.3.35.83: --------------------------------------------------------------------------
10.3.35.83: mpirun detected that one or more processes exited with non-zero status, thus causing
10.3.35.83: the job to be terminated. The first process to do so was:
10.3.35.83: 
10.3.35.83:   Process name: [[43564,1],7]
10.3.35.83:   Exit code:    2
10.3.35.83: --------------------------------------------------------------------------

cd /fsx/hchaoyan/home/m5/nccl-tests && \
sudo rm -rf build && \
make MPI=1 MPI_HOME=/opt/amazon/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/usr/lib/x86_64-linux-gnu

echo "running P4DN test"
cd /fsx/hchaoyan/home/m5/nccl-tests
$(which mpirun) -allow-run-as-root --mca plm_rsh_no_tree_spawn 1 \
-x FI_PROVIDER="efa" \
-x NCCL_SOCKET_IFNAME=eth \
-x FI_EFA_USE_DEVICE_RDMA=1 \
-x RDMAV_FORK_SAFE=1 \
-x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/opt/amazon/efa/lib:/opt/amazon/openmpi/lib:/home/$USER/aws-ofi-nccl/install/lib:$LD_LIBRARY_PATH \
-x NCCL_DEBUG=INFO \
-x NCCL_MIN_NCHANNELS=8 \
-x NCCL_ALGO=Ring \
-x OMP_NUM_THREADS=8 \
-x NCCL_NSOCKS_PERTHREAD=8 \
-x NCCL_SOCKET_NTHREADS=8 \
-bind-to none \
-n 16 -N 8 \
--mca pml ^cm \
--hostfile /job/hostfile \
-mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 \
./build/all_reduce_perf -b 2G -e 2G -g 1 -n 30

10.3.35.83: #   Rank 15 Pid     42 on ip-10-3-60-130 device  7 [0xa0] A100-SXM4-40GB
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83:
10.3.35.83: ip-10-3-35-83:362:362 [0] find_ofi_provider:543 NCCL WARN NET/OFI Couldn't find any optimal provider
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/IB : No device found.
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO NET/Socket : Using [0]eth0:10.3.35.83<0> [1]eth1:10.3.42.59<0> [2]eth2:10.3.48.121<0> [3]eth3:10.3.49.25<0>
10.3.35.83: ip-10-3-35-83:362:362 [0] NCCL INFO Using network Socket
10.3.35.83: NCCL version 2.10.3+cuda11.0
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:371:371 [7] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:369:369 [6] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
10.3.35.83: ip-10-3-35-83:367:367 [5] NCCL INFO NET/OFI Running on P4d platform, Setting NCCL_TOPO_FILE environment variable to /home/deepspeed/aws-ofi-nccl/install/share/aws-ofi-nccl/xml/p4d-24xl-topo.xml
10.3.35.83: ip-10-3-35-83:365:365 [3] NCCL INFO Bootstrap : Using eth0:10.3.35.83<0>

aws branch does not build on centos 7 with gcc 4.8.5

Environment

Operating System: Centos 7
GCC version: 4.8.5
Host: EC2 g4dn.12xlarge

Commit

bf57c29

Issue

aws branch fails make with error

make[2]: Entering directory `aws_ofi_nccl/source/aws_ofi_nccl/tests'
  CC       nccl_message_transfer.o
  CC       ring.o
  CCLD     nccl_connection
ring.c: In function ‘main’:
ring.c:42:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int recv_n = 0; recv_n < nrecv; recv_n++) {
  ^
ring.c:42:2: note: use option -std=c99 or -std=gnu99 to compile your code
nccl_message_transfer.c: In function ‘main’:
nccl_message_transfer.c:41:2: error: ‘for’ loop initial declarations are only allowed in C99 mode
  for (int recv_n = 0; recv_n < nrecv; recv_n++) {
  ^
nccl_message_transfer.c:41:2: note: use option -std=c99 or -std=gnu99 to compile your code
make[2]: *** [ring.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [nccl_message_transfer.o] Error 1
make[2]: Leaving directory `aws_ofi_nccl/source/aws_ofi_nccl/tests'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `aws_ofi_nccl/source/aws_ofi_nccl'
make: *** [all] Error 2

Reproducer

./autogen.sh
./configure --prefix=${PWD}/install --with-libfabric=/opt/amazon/efa --with-cuda=/usr/local/cuda --with-mpi=/opt/amazon/openmpi
make -j # <--- error

aws-ofi-nccl makes unnecessary calls to ofi_iflush() when using the PSM3 transport.

On some hardware, even a simple tensorflow test can end up calling ofi_iflush() tens of thousands of times per rank this serves no benefit since PSM3 ensures that GPU buffers are kept in sync after each I/O. In addition, because ofi_iflush() calls ofi_nccl_gdr_flush_disable() on every invocation, and ofi_nccl_gdr_flush_disable() acquires a mutex on each invocation, this adds further drag on performance.

Question - difference between main vs aws branches

I noticed there are two branches of this repo. What are their differences? Which one should I be installing?

NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error

Hello aws_ofi_nccl maintainers,

Please let me know if this is not the best location to post the issue and I will close this issue.

I am unable to figure out why the process is hanging after the error message is shown.

My training setup:
2 ml.g4dn.12xlarge instances on AWS Sagemaker trying to run distributed training with Pytorch base image 763104351884.dkr.ecr.us-west-1.amazonaws.com/pytorch-training:1.12.1-gpu-py38-cu113-ubuntu20.04-sagemaker.
The two instances are running inside a private subnet with a NAT gateway attached to the subnet.

All outputs are from host-1
Output of lspci -i efa:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma]
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08)
00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111
00:04.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:05.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:06.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:07.0 Ethernet controller: Amazon.com, Inc. Elastic Fabric Adapter (EFA)
00:08.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:09.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA)
00:1a.0 Non-Volatile memory controller: Amazon.com, Inc. Device 8061
00:1b.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1c.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1d.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1e.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)
00:1f.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller

Output of cat /opt/amazon/efa_installed_packages:

# EFA installer version: 1.15.1
# Debug packages installed: no
# Packages installed:
efa-config_1.9_all efa-profile_1.5_all libfabric-aws-bin_1.14.0amzn1.0_amd64 libfabric-aws-dev_1.14.0amzn1.0_amd64 libfabric1-aws_1.14.0amzn1.0_amd64 openmpi40-aws_4.1.2-1_amd64 ibacm_39.0-1_amd64 ibverbs-providers_39.0-1_amd64 ibverbs-utils_39.0-1_amd64 infiniband-diags_39.0-1_amd64 libibmad-dev_39.0-1_amd64 libibmad5_39.0-1_amd64 libibnetdisc-dev_39.0-1_amd64 libibnetdisc5_39.0-1_amd64 libibumad-dev_39.0-1_amd64 libibumad3_39.0-1_amd64 libibverbs-dev_39.0-1_amd64 libibverbs1_39.0-1_amd64 librdmacm-dev_39.0-1_amd64 librdmacm1_39.0-1_amd64 rdma-core_39.0-1_amd64 rdmacm-utils_39.0-1_amd64

Output of /opt/amazon/efa/bin/fi_info -p efa:

provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-rdm
    version: 114.10
    type: FI_EP_RDM
    protocol: FI_PROTO_EFA
provider: efa
    fabric: EFA-fe80::424:a9ff:fed5:b935
    domain: efa_0-dgrm
    version: 114.10
    type: FI_EP_DGRAM
    protocol: FI_PROTO_EFA

Output of training job:
distributed training is initialized with nccl backend in pytorch using the mmaction2 training library. I set FI_EFA_USE_DEVICE_RDMA=0 because the T4 gpus do not support RDMA. Also, the cmd is run as os.system() command in the entrypoint passed to sagemaker
cmd=

NCCL_SOCKET_IFNAME=eth0 FI_PROVIDER="efa" FI_EFA_USE_DEVICE_RDMA=0 NCCL_DEBUG=INFO FI_LOG_LEVEL=warn FI_LOG_PROV=efa PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python38.zip:/opt/conda/lib/python3.8:/opt/conda/lib/python3.8/lib-dynload:/opt/conda/lib/python3.8/site-packages:/opt/conda/lib/python3.8/site-packages/smdebug-1.0.22b20221214-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument-3.4.2-py3.8.egg:/opt/conda/lib/python3.8/site-packages/pyinstrument_cext-0.2.4-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/flash_attn-0.1-py3.8-linux-x86_64.egg:/opt/conda/lib/python3.8/site-packages/einops-0.6.0-py3.8.egg python -m torch.distributed.launch --nnodes=2 --node_rank=0  --master_addr=algo-1  --nproc_per_node=4  --master_port=7777  <train script> <config.py>

algo-1:462:462 [0] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:462:462 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:462:462 [0] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:462:462 [0] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:462:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:462:462 [0] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:462:462 [0] NCCL INFO NET/OFI Selected Provider is efa
algo-1:462:462 [0] NCCL INFO Using network AWS Libfabric
NCCL version 2.10.3+cuda11.3
algo-1:463:463 [1] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:464:464 [2] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO Bootstrap : Using eth0:10.200.5.135<0>
algo-1:465:465 [3] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:465:465 [3] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:464:464 [2] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:463:463 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v4 symbol.
algo-1:464:464 [2] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:463:463 [1] NCCL INFO NET/OFI Using aws-ofi-nccl 1.3.0aws
algo-1:465:465 [3] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:464:464 [2] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
algo-1:463:463 [1] NCCL INFO NET/OFI Setting FI_EFA_FORK_SAFE environment variable to 1
libfabric:465:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:465:465 [3] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:465:465 [3] NCCL INFO NET/OFI Selected Provider is efa
algo-1:465:465 [3] NCCL INFO Using network AWS Libfabric
libfabric:463:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
libfabric:464:1673041636:efa:core:rxr_info_to_rxr():506<warn> FI_HMEM capability requires RDMA, which this device does not support.
algo-1:463:463 [1] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:463:463 [1] NCCL INFO NET/OFI Selected Provider is efa
algo-1:463:463 [1] NCCL INFO Using network AWS Libfabric
algo-1:464:464 [2] NCCL INFO NET/OFI Forcing AWS OFI ndev 2
algo-1:464:464 [2] NCCL INFO NET/OFI Selected Provider is efa
algo-1:464:464 [2] NCCL INFO Using network AWS Libfabric
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:465:556 [3] NCCL INFO NET/OFI [3] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00
algo-1:464:558 [2] NCCL INFO NET/OFI [2] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:463:557 [1] NCCL INFO NET/OFI [1] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 0 busId 0000:00:1b.0 path /sys/devices/pci0000:00/
algo-1:462:555 [0] NCCL INFO NET/OFI [0] getCudaPath dev 1 busId 0000:00:1c.0 path /sys/devices/pci0000:00
algo-1:463:557 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
algo-1:464:558 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
algo-1:465:556 [3] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
algo-1:462:555 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4   5   6   7
algo-1:462:555 [0] NCCL INFO Trees [0] 1/4/-1->0->-1 [1] 1/-1/-1->0->4
algo-1:462:555 [0] NCCL INFO Channel 00 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 3[1e0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 2[1d0] via direct shared memory
algo-1:465:556 [3] NCCL INFO Channel 00 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:465:556 [3] NCCL INFO Channel 01 : 3[1e0] -> 4[1b0] [send] via NET/AWS Libfabric/0
algo-1:464:558 [2] NCCL INFO Connected all rings
algo-1:462:555 [0] NCCL INFO Channel 01 : 7[1e0] -> 0[1b0] [receive] via NET/AWS Libfabric/0
algo-1:462:555 [0] NCCL INFO Channel 00 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:462:555 [0] NCCL INFO Channel 01 : 0[1b0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Connected all rings
algo-1:464:558 [2] NCCL INFO Channel 00 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:464:558 [2] NCCL INFO Channel 01 : 2[1d0] -> 1[1c0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 00 : 1[1c0] -> 0[1b0] via direct shared memory
algo-1:463:557 [1] NCCL INFO Channel 01 : 1[1c0] -> 0[1b0] via direct shared memory
libfabric:465:1673041637:efa:cq:rxr_cq_write_tx_error():243<warn> rxr_cq_write_tx_error: err: 21, prov_err: Unknown error -21 (21)
algo-1:465:556 [3] ofi_process_cq:1033 NCCL WARN NET/OFI Request 0x7f6390394d18 completed with error. RC: 21. Error: unknown error. Completed length: 0, Request: { buffer_index: 255, dev: 0, size: 0, state: CREATED, direction: SEND }

I see the same error on the algo-2 instance as well.

Pytorch version and helper output by mmaction2:

2023-01-06 21:47:12,614 - mmaction - INFO - Environment info:
------------------------------------------------------------
sys.platform: linux
Python: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10) [GCC 10.3.0]
CUDA available: True
GPU 0,1,2,3: Tesla T4
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.3, V11.3.109
GCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 1.12.1+cu113
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.3
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.3.2  (built against CUDA 11.5)
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.13.1+cu113
OpenCV: 4.6.0
MMCV: 1.7.0
MMCV Compiler: GCC 9.3
MMCV CUDA Compiler: 11.3
MMAction2: 0.24.1+
------------------------------------------------------------
2023-01-06 21:47:12,614 - mmaction - INFO - Distributed training: True

Is it possible to use this over EFA?

Hi,

I'm wondering whether it's possible for this to utilize elastic fabric adapter.

Thanks

aws / aws-ofi-nccl Goto Github PK

aws-ofi-nccl's People

Contributors

Stargazers

Watchers

Forkers

aws-ofi-nccl's Issues

System Info:

Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA).

Overview of issue

Repro steps

Common Setup

AL2 Setup

Ubuntu Setup

Results

Run Logs

AL2 (with EFA)

AL2 (no EFA)

Ubuntu (EFA)

Ubuntu (no EFA)

Overview of issue

Repro Steps

Ubuntu Setup

Other Observations

Environment

Commit

Issue

Reproducer

Recommend Projects

Recommend Topics

Recommend Org