rocm / rccl Goto Github PK

View Code? Open in Web Editor NEW

245.0 32.0 112.0 7.5 MB

ROCm Communication Collectives Library (RCCL)

Home Page: https://rocmdocs.amd.com/projects/rccl/en/latest/

License: Other

C 17.02% Makefile 0.76% C++ 75.49% Shell 1.38% CMake 2.68% Cuda 0.39% Groovy 0.33% Awk 0.25% Python 1.70%

rccl's Introduction

RCCL

ROCm Communication Collectives Library

Introduction

RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.

Requirements

ROCm supported GPUs
ROCm stack installed on the system (HIP runtime & HIP-Clang)

Quickstart RCCL Build

RCCL directly depends on HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack. For ROCm installation instructions, see https://github.com/ROCm/ROCm.

The root of this repository has a helper script install.sh to build and install RCCL with a single command. It hard-codes configurations that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install RCCL.

To build the library using the install script:

./install.sh

For more info on build options/flags when using the install script, use ./install.sh --help

./install.sh --help
RCCL build & installation helper script
 Options:
       --address-sanitizer     Build with address sanitizer enabled
    -d|--dependencies          Install RCCL depdencencies
       --debug                 Build debug library
       --enable_backtrace      Build with custom backtrace support
       --disable-colltrace     Build without collective trace
       --disable-msccl-kernel  Build without MSCCL kernels
       --disable-mscclpp       Build without MSCCL++ support
    -f|--fast                  Quick-build RCCL (local gpu arch only, no backtrace, and collective trace support)
    -h|--help                  Prints this help message
    -i|--install               Install RCCL library (see --prefix argument below)
    -j|--jobs                  Specify how many parallel compilation jobs to run ($nproc by default)
    -l|--local_gpu_only        Only compile for local GPU architecture
       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, seperate by ';' (builds for all supported GPU architectures by default)
       --no_clean              Don't delete files if they already exist
       --npkit-enable          Compile with npkit enabled
       --openmp-test-enable    Enable OpenMP in rccl unit tests
       --roctx-enable          Compile with roctx enabled (example usage: rocprof --roctx-trace ./rccl-program)
    -p|--package_build         Build RCCL package
       --prefix                Specify custom directory to install RCCL to (default: `/opt/rocm`)
       --rm-legacy-include-dir Remove legacy include dir Packaging added for file/folder reorg backward compatibility
       --run_tests_all         Run all rccl unit tests (must be built already)
    -r|--run_tests_quick       Run small subset of rccl unit tests (must be built already)
       --static                Build RCCL as a static library instead of shared library
    -t|--tests_build           Build rccl unit tests, but do not run
       --time-trace            Plot the build time of RCCL (requires `ninja-build` package installed on the system)
       --verbose               Show compile commands

By default, RCCL builds for all GPU targets defined in DEFAULT_GPUS in CMakeLists.txt. To target specific GPU(s), and potentially reduce build time, use --amdgpu_targets as a ; separated string listing GPU(s) to target.

Manual build

To build the library using CMake:

$ git clone https://github.com/ROCm/rccl.git
$ cd rccl
$ mkdir build
$ cd build
$ cmake ..
$ make -j 16      # Or some other suitable number of parallel jobs

You may substitute an installation path of your own choosing by passing CMAKE_INSTALL_PREFIX. For example:

$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..

Note: ensure rocm-cmake is installed, apt install rocm-cmake.

To build the RCCL package and install package :

Assuming you have already cloned this repository and built the library as shown in the previous section:

$ cd rccl/build
$ make package
$ sudo dpkg -i *.deb

RCCL package install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.

Enabling peer-to-peer transport

In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable HSA_FORCE_FINE_GRAIN_PCIE=1 is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.

Tests

There are rccl unit tests implemented with the Googletest framework in RCCL. The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh). To invoke the rccl unit tests, go to the build folder, then the test subfolder, and execute the appropriate rccl unit test executable(s).

rccl unit test names are now of the format:

CollectiveCall.[Type of test]

Filtering of rccl unit tests should be done with environment variable and by passing the --gtest_filter command line flag, for example:

UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"

will run only AllReduce correctness tests with float16 datatype. A list of available filtering environment variables appears at the top of every run. See "Running a Subset of the Tests" at https://chromium.googlesource.com/external/github.com/google/googletest/+/HEAD/googletest/docs/advanced.md for more information on how to form more advanced filters.

There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCm/rccl-tests. See the rccl-tests README for more information on how to build and run those tests.

NPKit

RCCL integrates NPKit, a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.

Please check NPKit sample workflow for RCCL as a fully automated usage example. It also provides good templates for the following manual instructions.

To manually build RCCL with NPKit enabled, pass -DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)" with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix ENABLE_NPKIT_, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and -DNPKIT_FLAGS should follow this rule.

To manually run RCCL with NPKit enabled, environment variable NPKIT_DUMP_DIR needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.

To manually analyze NPKit dump results, please leverage npkit_trace_generator.py.

MSCCL/MSCCL++

RCCL integrates MSCCL(https://github.com/microsoft/msccl) and MSCCL++ (https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.

MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting RCCL_MSCCL_FORCE_ENABLE=1.

On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable RCCL_ENABLE_MSCCLPP=1 to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable RCCL_MSCCLPP_THRESHOLD. Once RCCL_MSCCLPP_THRESHOLD (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.

Library and API Documentation

Please refer to the RCCL Documentation Site for current documentation.

How to build documentation

Run the steps below to build documentation locally.

cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

Improving performance on MI300 when using less than 8 GPUs

On a system with 8*MI300X GPUs, each pair of GPUs are connected with dedicated XGMI links in a fully-connected topology. So, for collective operations, one can achieve good performance when all 8 GPUs (and all XGMI links) are used. When using less than 8 GPUs, one can only achieve a fraction of the potential bandwidth on the system.

But, if your workload warrants using less than 8 MI300 GPUs on a system, you can set the run-time variable NCCL_MIN_NCHANNELS to increase the number of channels.
E.g.: export NCCL_MIN_NCHANNELS=32

Increasing the number of channels can be beneficial to performance, but it also increases GPU utilization for collective operations.

Additionally, we have pre-defined higher number of channels when using only 2 GPUs or 4 GPUs on a 8*MI300 system. Here, RCCL will use 32 channels for the 2 MI300 GPUs scenario and 24 channels for the 4 MI300 GPUs scenario.

Copyright

rccl's People

Contributors

Stargazers

Watchers

Forkers

aditya4d1 sabreshao jichangjichang abhinavvishnu wenkaidu acowley dcyca pramenku imcarsonliao gilbertlee-amd saadrahim jeffdaily stanleytsang-amd hephaex lynex mhbliao rpathani amdkila scchan changpeng monnetb eidenyoshida henrytangintel aaronenyeshi murtadhaaldallal lijietang vamsi5123 luckyboyde rocmsupport ashutom johan-kok paulfreddy tomsang lawruble13 hubertlu-tw keerthana4708 rachanaramanna raramakr rocmmathlibrariesbot pavahora rkamd edgargabriel minsii yzygitzh nunnikri voldmare summerxia srinivamd abuccts dmikushin sidamd mtxuhao anghostcici cosmicfusion whchung lajagapp pedramalizadeh sergiykostrov codego7250 paklui sfantao samjwu yaohual arvindcheru garywangcn cgmb nusislam bertandogancay aiworkspace ablazet liangyurain clearsky07 astatide sreeram-arista shaojiewang ddpagan andyli1007 baiwei0427 nileshnegi cenzhao trixirt a1947z atlantapepsi pradeep-ramanna ashuverm pash-msft kleenelan hpourreza shiltian zhguanw-amd swraw abhimeda corey-derochie-amd mberenjk vincentalphapsi yuanwei2023 clementlincf saurabhamd pemberso-xilinx neelabhsahay

rccl's Issues

RCCL with multi IB ports

Env: ROCm 3.9

RCCL may not support two IB ports for one rank, like NCCL_IB_HCA=mlx5_0:1,mlx5_1:1. If I want to implement that, how should I do. No plugin found (librccl-net.so), using internal implementation, for NCCL, one of its plugins like https://github.com/aws/aws-ofi-nccl, how about the RCCL. And nv_peer_memory module is needed for NCCL supported by MLX for GPUDirect RDMA, how about RCCL, I think this module may not be available for RCCL.

Thanks :)

RCCL sample program crashes with `Memory access fault by GPU node-1` for 2 GPUs machine

OS: Ubuntu 18.04, 16.04. Kernel 4.15
GPUs: 2 x VEGA, or 2 x gfx803
ROCm: 2.2
RCCL: 0.7.5(build master and installed it to /opt/rocm/rccl)

HIP version: 1.5.19055
HCC clang version 9.0.0 (/data/jenkins_workspace/compute-rocm-rel-2.2/external/hcc-tot/clang c792478f19beee13540053f188094898a008d245) (/data/jenkins_workspace/compute-rocm-rel-2.2/external/hcc-tot/compiler 22192ff6ed5120e2f4ec58a13049025537908ec3) (based on HCC 1.3.19092-1dcecff-c792478-22192ff )
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/hcc/bin

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     75.0c  96.0W    1138Mhz 800Mhz  8.0GT/s, x16   32.94%  manual  100.0W   0%        0%       99%      
1     73.0c  98.0W    1138Mhz 800Mhz  8.0GT/s, x16   12.94%  manual  100.0W   0%        0%       100%     
================================================================================================
========================               End of ROCm SMI Log              ========================

Procedure is

$ /opt/rocm/hip/bin/hipcc -hc -I/opt/rocm/rccl/include allreduce.cpp -L/opt/rocm/lib -lrccl
$ ./a.out
Memory access fault by GPU node-1 (Agent handle: 0x1244a40) on address 0x7f365cc02000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

HCC crashes during linking

System configuration:

Debian buster

HCC: df05d4fda454411e5ccb394796bfa3dbc5136918
LLVM: 213053bf7933b050aa1974532929fc3c534ec4ed
HIP: a2d465c5ba946d86f6a59c8716d1604019349cf5

Problem:

build$ make VERBOSE=1
/usr/bin/cmake -S/home/vlad/distr/ROCm/rccl -B/home/vlad/distr/ROCm/rccl/build --check-build-system CMakeFiles/Makefile.cmake 0
/usr/bin/cmake -E cmake_progress_start /home/vlad/distr/ROCm/rccl/build/CMakeFiles /home/vlad/distr/ROCm/rccl/build/CMakeFiles/progress.marks
make -f CMakeFiles/Makefile2 all
make[1]: Entering directory '/home/vlad/distr/ROCm/rccl/build'
make -f CMakeFiles/rccl.dir/build.make CMakeFiles/rccl.dir/depend
make[2]: Entering directory '/home/vlad/distr/ROCm/rccl/build'
cd /home/vlad/distr/ROCm/rccl/build && /usr/bin/cmake -E cmake_depends "Unix Makefiles" /home/vlad/distr/ROCm/rccl /home/vlad/distr/ROCm/rccl /home/vlad/distr/ROCm/rccl/build /home/vlad/distr/ROCm/rccl/build /home/vlad/distr/ROCm/rccl/build/CMakeFiles/rccl.dir/DependInfo.cmake --color=
make[2]: Leaving directory '/home/vlad/distr/ROCm/rccl/build'
make -f CMakeFiles/rccl.dir/build.make CMakeFiles/rccl.dir/build
make[2]: Entering directory '/home/vlad/distr/ROCm/rccl/build'
[  2%] Linking CXX shared library librccl.so
/usr/bin/cmake -E cmake_link_script CMakeFiles/rccl.dir/link.txt --verbose=1
/opt/rocm/hcc/bin/hcc -fPIC   -shared -Wl,-soname,librccl.so.1 -o librccl.so.1.0 CMakeFiles/rccl.dir/src/collectives/device/all_reduce.cpp.o CMakeFiles/rccl.dir/src/collectives/device/all_gather.cpp.o CMakeFiles/rccl.dir/src/collectives/device/reduce.cpp.o CMakeFiles/rccl.dir/src/collectives/device/broadcast.cpp.o CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter.cpp.o CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o CMakeFiles/rccl.dir/src/init.cc.o CMakeFiles/rccl.dir/src/graph/trees.cc.o CMakeFiles/rccl.dir/src/graph/rings.cc.o CMakeFiles/rccl.dir/src/graph/paths.cc.o CMakeFiles/rccl.dir/src/graph/search.cc.o CMakeFiles/rccl.dir/src/graph/connect.cc.o CMakeFiles/rccl.dir/src/graph/tuning.cc.o CMakeFiles/rccl.dir/src/graph/topo.cc.o CMakeFiles/rccl.dir/src/graph/xml.cc.o CMakeFiles/rccl.dir/src/collectives/all_reduce.cc.o CMakeFiles/rccl.dir/src/collectives/all_gather.cc.o CMakeFiles/rccl.dir/src/collectives/reduce.cc.o CMakeFiles/rccl.dir/src/collectives/broadcast.cc.o CMakeFiles/rccl.dir/src/collectives/reduce_scatter.cc.o CMakeFiles/rccl.dir/src/channel.cc.o CMakeFiles/rccl.dir/src/misc/argcheck.cc.o CMakeFiles/rccl.dir/src/misc/nvmlwrap_stub.cc.o CMakeFiles/rccl.dir/src/misc/utils.cc.o CMakeFiles/rccl.dir/src/misc/ibvwrap.cc.o CMakeFiles/rccl.dir/src/transport/coll_net.cc.o CMakeFiles/rccl.dir/src/transport/net.cc.o CMakeFiles/rccl.dir/src/transport/net_ib.cc.o CMakeFiles/rccl.dir/src/transport/net_socket.cc.o CMakeFiles/rccl.dir/src/transport/p2p.cc.o CMakeFiles/rccl.dir/src/transport/shm.cc.o CMakeFiles/rccl.dir/src/transport.cc.o CMakeFiles/rccl.dir/src/debug.cc.o CMakeFiles/rccl.dir/src/group.cc.o CMakeFiles/rccl.dir/src/bootstrap.cc.o CMakeFiles/rccl.dir/src/enqueue.cc.o  -Wl,-rpath,/opt/rocm/lib: --amdgpu-target=gfx803 --amdgpu-target=gfx900 --amdgpu-target=gfx906 --amdgpu-target=gfx908 -hc-function-calls /opt/rocm/lib/libhip_hcc.so.3.5.20173.4505-a2d465c5 -Wl,--enable-new-dtags -hc -L /opt/rocm/hcc/lib -Wl,-rpath /opt/rocm/hcc/lib -Wl,--whole-archive /opt/rocm/hcc/lib/libmcwamp.so.3.1.0 -Wl,--no-whole-archive -ldl -lm /opt/rocm/hcc/lib/libhc_am.so.3.1.0 /opt/rocm/lib/libhsa-runtime64.so -Wl,-rpath-link,/opt/rocm/lib 
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.	Program arguments: /opt/rocm/hcc/bin/llc -mtriple amdgcn-amd-amdhsa -mcpu=gfx803 -mattr=-code-object-v3 -O3 --frame-pointer=none -amdgpu-function-calls=1 -filetype=obj -o /tmp/tmp.2TpBX1aB2X/kernel-gfx803.hsaco.isabin /tmp/tmp.2TpBX1aB2X/kernel-gfx803.hsaco.opt.bc 
1.	Running pass 'CallGraph Pass Manager' on module '/tmp/tmp.2TpBX1aB2X/kernel-gfx803.hsaco.opt.bc'.
2.	Running pass 'Prologue/Epilogue Insertion & Frame Finalization' on function '@_Z23ncclAllReduceRingKernelILi2E7FuncMaxIhEhEvP14CollectiveArgs'
 #0 0x0000560cb0af114a llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/opt/rocm/hcc/bin/llc+0x18b014a)
 #1 0x0000560cb0aef054 llvm::sys::RunSignalHandlers() (/opt/rocm/hcc/bin/llc+0x18ae054)
 #2 0x0000560cb0aef188 SignalHandler(int) (/opt/rocm/hcc/bin/llc+0x18ae188)
 #3 0x00007f93d1378110 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14110)
 #4 0x0000560caf68bad9 llvm::SIFrameLowering::emitPrologue(llvm::MachineFunction&, llvm::MachineBasicBlock&) const (.cold) (/opt/rocm/hcc/bin/llc+0x44aad9)
 #5 0x0000560cb01589f5 (anonymous namespace)::PEI::runOnMachineFunction(llvm::MachineFunction&) (/opt/rocm/hcc/bin/llc+0xf179f5)
 #6 0x0000560cb0054554 llvm::MachineFunctionPass::runOnFunction(llvm::Function&) (/opt/rocm/hcc/bin/llc+0xe13554)
 #7 0x0000560cb03f1b80 llvm::FPPassManager::runOnFunction(llvm::Function&) (/opt/rocm/hcc/bin/llc+0x11b0b80)
 #8 0x0000560cafcbfb5b (anonymous namespace)::CGPassManager::runOnModule(llvm::Module&) (/opt/rocm/hcc/bin/llc+0xa7eb5b)
 #9 0x0000560cb03f3498 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/opt/rocm/hcc/bin/llc+0x11b2498)
#10 0x0000560caf6a7245 main (/opt/rocm/hcc/bin/llc+0x466245)
#11 0x00007f93d0e48e0b __libc_start_main /build/glibc-GwnBeO/glibc-2.30/csu/../csu/libc-start.c:342:3
#12 0x0000560caf703c8a _start (/opt/rocm/hcc/bin/llc+0x4c2c8a)
/opt/rocm/hcc/bin/clamp-device: line 262: 276500 Segmentation fault      $LLC -mtriple amdgcn-amd-amdhsa -mcpu=$AMDGPU_TARGET $CODE_OBJECT_FORMAT $HCC_OPT $KMOPTLLC -amdgpu-function-calls=$AMDGPU_FUNC_CALLS -filetype=obj -o $2.isabin $2.opt.bc
Generating AMD GCN kernel failed in llc for target: gfx803
clang-11: error: linker command failed with exit code 88 (use -v to see invocation)
make[2]: *** [CMakeFiles/rccl.dir/build.make:613: librccl.so.1.0] Error 88
make[2]: Leaving directory '/home/vlad/distr/ROCm/rccl/build'
make[1]: *** [CMakeFiles/Makefile2:76: CMakeFiles/rccl.dir/all] Error 2
make[1]: Leaving directory '/home/vlad/distr/ROCm/rccl/build'
make: *** [Makefile:152: all] Error 2

Does rccl support inter node communication without IB？

FYI

How to use socketdirect technology similar to UCX in UCX_ IB_ ADDR_ TYPE=ib_ global？

Rccl error, unhandled system error

OS: centos:7.8.2003
-- HIP version: 4.0.20496-4f163c68
-- HIP_CLANG_PATH: /opt/rocm/llvm/bin
-- miopen version: 2.9.0.8252-26-64506314
-- rocblas version: 2.32.0.2844-cc18d25f
-- hiprand version: 2.10.6.746-26-3932e69
-- rocrand version: 2.10.6.746-26-3932e69
-- rccl version: 2.7.8.492-26-e87f28e
-- rocthrust version: 2.10.6.809-26-9041d6c
-- hipcub version: 2.10.5.207-26-7bda2e4
-- rocprim version: 2.10.5.1102-26-9d47868
-- hipsparse version: 1.9.4.307-26-39bdb97
-- rocsparse version: 1.18.0.863-26-296dfe7
-- rocfft version: 1.0.8.966-26-2d35fd6
-- HIP library name: amdhip64
-- ROCM_HIPRTC_LIB: /opt/rocm/hip/lib/libamdhip64.so
-- Current RCCL header is /opt/rocm-4.0.1/rccl/include/rccl.h. Current RCCL version is v2708.

Running two containers in same node as following docker run command，rocm-smi is able to find 2 GPUs in each container, and these two container is able to ping each ips and logon by ssh port 22 without passwd after configure.

# node 1
docker run -dit --name liqi-rocm1 -v /home/liqi27/host1:/workspace \
     --shm-size=128G -v /Data:/Data --network=test-net -p 2222:22 \
     --device=/dev/kfd --device=/dev/dri/renderD128 --device=/dev/dri/renderD129 --group-add video \
     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
     qili93/paddle:latest-centos-dev-rocm401 /bin/bash \
     -c "/usr/sbin/sshd -D"

# node2
docker run -dit --name liqi-rocm2 -v /home/liqi27/host2:/workspace \
     --shm-size=128G -v /Data:/Data --network=test-net -p 3333:22 \
     --device=/dev/kfd --device=/dev/dri/renderD130 --device=/dev/dri/renderD131 --group-add video \
     --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
     qili93/paddle:latest-centos-dev-rocm401 /bin/bash \
     -c "/usr/sbin/sshd -D"

Run distributed training with export NCCL_DEBUG=INFO, and found error log as following:

/opt/conda/lib/python3.7/site-packages/paddle/distributed/fleet/base/fleet_base.py:632: UserWarning: It is recommended to use DistributedStrategy in fleet.init(). The strategy here is only for compatibility. If the strategy in fleet.distributed_optimizer() is not None, then it will overwrite the DistributedStrategy in fleet.init(), which will take effect in distributed training.
  "It is recommended to use DistributedStrategy "
W0315 11:13:58.839251   578 device_context.cc:368] Please NOTE: device: 1, GPU Compute Capability: 90.0, Driver API Version: 321.0, Runtime API Version: 3.1
W0315 11:13:58.839298   578 device_context.cc:381] device: 1, MIOpen Version: 2.9.0
I0315 11:13:59.260716   578 gen_comm_id_helper.cc:179] Server listening on: 10.0.1.17:6071 successful.
/opt/conda/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
/opt/conda/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py:89: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(slot[0], (np.ndarray, np.bool, numbers.Number)):
526432f383cd:578:578 [1] NCCL INFO Bootstrap : Using [0]eth0:10.0.1.17<0> [1]eth1:172.17.0.3<0>
526432f383cd:578:578 [1] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation

526432f383cd:578:578 [1] /root/driver/rccl/src/misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
526432f383cd:578:578 [1] NCCL INFO NET/Socket : Using [0]eth0:10.0.1.17<0> [1]eth1:172.17.0.3<0>
526432f383cd:578:578 [1] NCCL INFO Using network Socket
526432f383cd:578:659 [1] NCCL INFO RCCL AllToAll(v)/Scatter/Gather kernels enabled
526432f383cd:578:659 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
526432f383cd:578:659 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1 [2] -1/-1/-1->1->0|0->1->-1/-1/-1 [3] -1/-1/-1->1->0|0->1->-1/-1/-1
526432f383cd:578:659 [1] NCCL INFO Setting affinity for GPU 1 to ff00
526432f383cd:578:659 [1] NCCL INFO Channel 00 : 1[26000] -> 2[43000] [send] via NET/Socket/0/MEM1
526432f383cd:578:659 [1] NCCL INFO Channel 00 : 1[26000] -> 0[4000] via direct shared memory
526432f383cd:578:659 [1] NCCL INFO Channel 01 : 1[26000] -> 2[43000] [send] via NET/Socket/1/MEM1
526432f383cd:578:659 [1] NCCL INFO Call to connect returned Connection timed out, retrying
526432f383cd:578:659 [1] NCCL INFO Call to connect returned Connection timed out, retrying

526432f383cd:578:659 [1] /root/driver/rccl/src/include/socket.h:408 NCCL WARN Connect to 172.17.0.4<46303> failed : Connection timed out
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/transport/net_socket.cc:314 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/include/net.h:22 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/transport/net.cc:175 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/transport.cc:69 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/init.cc:922 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/init.cc:1038 -> 2
526432f383cd:578:659 [1] NCCL INFO /root/driver/rccl/src/group.cc:74 -> 2 [Async thread]
Traceback (most recent call last):
  File "tools/static/train.py", line 191, in <module>
    main(args)
  File "tools/static/train.py", line 163, in main
    lr_scheduler)
  File "/workspace/PaddleClas/tools/static/program.py", line 538, in run
    fetch_list=fetch_list)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1110, in run
    six.reraise(*sys.exc_info())
  File "/opt/conda/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1108, in run
    return_merged=return_merged)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1217, in _run_impl
    program._graph._compile(scope, self.place)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 465, in _compile
    use_device=use_device, scope=self._scope, places=self._places)
  File "/opt/conda/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 415, in _compile_data_parallel
    self._exec_strategy, self._build_strategy, self._graph)
OSError: (External)  Rccl error, unhandled system error  (at /workspace/Github-qili93/Paddle/paddle/fluid/platform/nccl_helper.h:77)

The code around nccl_helper.h:77 is here:

https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/platform/nccl_helper.h#L77

Removing use of std::atomic

It seems rccl is currently using std::atomic. Although HCC is able to compile the code, deep inside, there are some issues for supporting std::atomic with HCC. For details, please see discussions about supporting std::atomic in HIP for hip-clang. ROCm/HIP#764

As such, I suggest rccl to use hip atomics (https://github.com/ROCm-Developer-Tools/HIP/blob/master/include/hip/hcc_detail/hip_atomic.h) instead of std::atomics.

The question: is there an example of direct GPU-to-GPU send and receive operations?

Hello!
Thank you, colleagues, for a great library!

I'd like to try to work with a direct GPU-to-GPU send and receive operations. Is there any example (or test) in the repository?

Build fails on ubuntu18.04

Following installation instruction for tensorflow at:

https://rocm-documentation.readthedocs.io/en/latest/Deep_learning/Deep-learning.html#building-pytorch-for-rocm

However rccl is not available in the ubuntu repo for 18.04. Tried to compile following instructions on the git-repo. Installation fails at:

error: use of undeclared identifier 'hipDeviceAttributeHdpMemFlushCntl'; did you mean 'hipDeviceAttributeMemoryBusWidth'?

Fixing the example code on the wiki page

It seems that the example is incomplete. Can we please link the correct example in the samples?

Build failure - need dependencies.

Kindly specify the perquisites/dependencies for this package.

I get the below message (indicating some missing dependencies)

CMake Error at CMakeLists.txt:33 (find_package):
  Could not find a package configuration file provided by "ROCM" with any of
  the following names:

    ROCMConfig.cmake
    rocm-config.cmake

  Add the installation prefix of "ROCM" to CMAKE_PREFIX_PATH or set
  "ROCM_DIR" to a directory containing one of the above files.  If "ROCM"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
See also "/home/rocm/rccl_build/CMakeFiles/CMakeOutput.log".

RCCL plugin uses UCX

When RCCL plugin uses UCX, PCI creation fails. Refer to the previous issues. What is the specific reason?
#480

j17r2n14:15524:15581 [0] NCCL INFO Connected all rings comm 0x2aaf44000c00 nRanks 08 busId 4000
j17r2n13:31976:32049 [0] NCCL INFO Connected all rings comm 0x2b6520000c00 nRanks 08 busId 4000
j17r2n14:15524:15581 [0] NCCL INFO Channel 00 : 0[4000] -> 4[4000] [receive] via NET/UCX/0/GDRDMA comm 0x2aaf44000c00 nRanks 08
j17r2n13:31976:32049 [0] NCCL INFO Channel 00 : 4[4000] -> 0[4000] [receive] via NET/UCX/0/GDRDMA comm 0x2b6520000c00 nRanks 08
j17r2n14:15524:15581 [0] NCCL INFO Channel 01 : 0[4000] -> 4[4000] [receive] via NET/UCX/0/GDRDMA comm 0x2aaf44000c00 nRanks 08
j17r2n13:31976:32049 [0] NCCL INFO Channel 01 : 4[4000] -> 0[4000] [receive] via NET/UCX/0/GDRDMA comm 0x2b6520000c00 nRanks 08
j17r2n14:15524:15581 [0] NCCL INFO Channel 00 : 4[4000] -> 0[4000] [send] via NET/UCX/0 comm 0x2aaf44000c00 nRanks 08
j17r2n13:31976:32049 [0] NCCL INFO Channel 00 : 0[4000] -> 4[4000] [send] via NET/UCX/0 comm 0x2b6520000c00 nRanks 08
j17r2n14:15524:15581 [0] NCCL INFO Channel 01 : 4[4000] -> 0[4000] [send] via NET/UCX/0 comm 0x2aaf44000c00 nRanks 08
j17r2n13:31976:32049 [0] NCCL INFO Channel 01 : 0[4000] -> 4[4000] [send] via NET/UCX/0 comm 0x2b6520000c00 nRanks 08
===ipc/rocm_ipc_md.c 71 error info: status:4097 base_ptr:0x2b651b200000 size:6422528 key->ipc:0x2b6520b4a260===
[1645148934.092347] [j17r2n13:31976:0]     rocm_ipc_md.c:72   UCX  ERROR Failed to create ipc for 0x2b651b200000/620000
===ipc/rocm_ipc_md.c 71 error info: status:4097 base_ptr:0x2aaf3f000000 size:6422528 key->ipc:0x2aaf4477f820===
[1645148934.092526] [j17r2n14:15524:0]     rocm_ipc_md.c:72   UCX  ERROR Failed to create ipc for 0x2aaf3f000000/620000
===ipc/rocm_ipc_md.c 71 error info: status:4097 base_ptr:0x2b653a000000 size:6422528 key->ipc:0x2b652069df90===
[1645148934.093308] [j17r2n13:31976:0]     rocm_ipc_md.c:72   UCX  ERROR Failed to create ipc for 0x2b653a000000/620000
===ipc/rocm_ipc_md.c 71 error info: status:4097 base_ptr:0x2aaf62800000 size:6422528 key->ipc:0x2aaf4477f5f0===
[1645148934.093589] [j17r2n14:15524:0]     rocm_ipc_md.c:72   UCX  ERROR Failed to create ipc for 0x2aaf62800000/620000
===ipc/rocm_ipc_md.c 75 success info: status:0 base_ptr:0x2b651a800000 size:8388608 key->ipc:0x2b6520b4ad80===
===ipc/rocm_ipc_md.c 75 success info: status:0 base_ptr:0x2aaf3e600000 size:8388608 key->ipc:0x2aaf44ba8b30===
===ipc/rocm_ipc_md.c 75 success info: status:0 base_ptr:0x2b6539600000 size:8388608 key->ipc:0x2b6520b9d650===
===ipc/rocm_ipc_md.c 75 success info: status:0 base_ptr:0x2aaf61e00000 size:8388608 key->ipc:0x2aaf44b9df90===

Remove build system dependence on ROCm hardware

https://github.com/ROCmSoftwarePlatform/rccl/blob/7b59e9396310cce0adeb95831b820c882e9bb5f9/src/Makefile#L7
Add "-amdgpu-target=gfx803 -amdgpu-target=gfx900" in line7 and line11 so rccl can build without ROCm hardware.

GPUs cluster with a distributed memory

Hello, dear colleagues!

I have 2 nodes with a distributed memory. GPUs 0,1,2,3 are installed on Node 0 and are connected by Infiniti Fabric. Node 1 has 4 more GPUs installed, which are also interconnected by Infiniti Fabric. Thus, nodes 0 and 1 have a distributed memory system, but within a separate Node GPUs can interact via Infinity Fabric.

If I use a collective P2P for N0:GPU X -> N1:GPU Y, the exchange be done via sockets?
If I use a collective AllGather, will Infinity Fabric be used within GPU:0,1,2,3 on Node 0 (and similarly within GPUs on Node 1), or will all transfers (N0:GPUx1-N0:GPUx2 and N0:GPUy1-N1:GPUy2 be done via sockets)

Thanks any way)!

What is difference between rccl and nccl?

I have one question. When I saw the source code of rccl, it looks similar with nccl, are there any diffrece between two?

I can't run this demo.

CentOS 7.6
ROCm2.9
RCCL(NCCL) version 2.4.8

The demo code is as follows:

#include "hip/hip_runtime.h"
#include <stdio.h>
#include <malloc.h>
#include <stdlib.h>
#include "rccl.h"

#define HIP_CHECK(command){  \
	hipError_t status = command; \
	if(status != hipSuccess){ \
		fprintf(stderr,"Error: HIP reports:%s\n", hipGetErrorString(status)); \
	} \
}

#define NCCLCHECK(cmd) do{   \
    ncclResult_t r=cmd;    \
    if(r!=ncclSuccess){     \
        printf("Failed, NCCL error %s:%d '%s'\n",__FILE__,__LINE__,ncclGetErrorString(r));   \
        exit(1);    \
    }                        \
} while(0)

int main(int argc, char** argv) {
    printf("begin run main...\n");

    ncclComm_t comms[1];

    //manageing 1 devices
    int nDev = 2;
    int size = 32;
    int devs[2] = { 0,1 };

    //allocating and initializing device buffers
    float** sendbuff = (float**)malloc(nDev * sizeof(float*));
    float** recvbuff = (float**)malloc(nDev * sizeof(float*));

    hipStream_t* s = (hipStream_t*)malloc(nDev * sizeof(hipStream_t));

    for (int i = 0;i < nDev;i++) {
        printf("cudaSetdevice..\n");
        HIP_CHECK(hipSetDevice(i));
        HIP_CHECK(hipMalloc(sendbuff + i, size * sizeof(float)));
        HIP_CHECK(hipMalloc(recvbuff + i, size * sizeof(float)));
        HIP_CHECK(hipMemset(sendbuff[i], 1, size * sizeof(float)));
        HIP_CHECK(hipMemset(recvbuff[i], 0, size * sizeof(float)));
        float* h_arr;
        h_arr = (float*)malloc(size * sizeof(float));
        for (int i = 0; i < size; ++i)
            h_arr[i] = i; // Or other values
        HIP_CHECK(hipMemcpy(sendbuff[i], h_arr, size * sizeof(float), hipMemcpyHostToDevice));
        //CUDACHECK(hipStreamCreate(s+i));
    }
    //initilzing NCCL
    NCCLCHECK(ncclCommInitAll(comms, nDev, devs));
    //calling NCCL communication API. Group API is required when using
    //multiple devices per thread
    NCCLCHECK(ncclGroupStart());

    for (int i = 0; i < nDev; ++i)
        //NCCLCHECK(ncclAllReduce((const void*)sendbuff[i], (void*)recvbuff[i], size, ncclFloat, ncclSum, comms[i], s[i]));
        NCCLCHECK(ncclReduce((const void *)sendbuff[i], (void*)recvbuff[i],size,ncclFloat, ncclSum, 0,comms[i], s[i]));


    NCCLCHECK(ncclGroupEnd());
    printf("\n ncclGroupEnd..\n");
    //synchronizing on CUDA streams to wait for completion of NCCL operation
    for (int i = 0; i < nDev; ++i) {
        HIP_CHECK(hipSetDevice(i));
        HIP_CHECK(hipStreamSynchronize(s[i]));
    }
    for (int i = 0; i < nDev; ++i) {
        HIP_CHECK(hipSetDevice(i));
        float* recvCPU = (float*)malloc(size * sizeof(float));
        HIP_CHECK(hipMemcpy(recvCPU, recvbuff[i], sizeof(float) * size, hipMemcpyDeviceToHost));
        printf("End Reduce Dev is %d of process, RecvBUf is %f,%f,%f,%f\n", i, recvCPU[0], recvCPU[1], recvCPU[2], recvCPU[3]);

    }
    //free device buffers
    for (int i = 0; i < nDev; ++i) {
        HIP_CHECK(hipSetDevice(i));
        HIP_CHECK(hipFree(sendbuff[i]));
        HIP_CHECK(hipFree(recvbuff[i]));
    }

    //finalizing NCCL
    for (int i = 0; i < nDev; ++i)
        ncclCommDestroy(comms[i]);

    printf("Success \n");
    return 0;
}

Compile and run as follows:

hipcc -l rccl demo.cpp
NCCL_DEBUG=INFO ./a.out

The screen output is as follows:

begin run main...
cudaSetdevice..
cudaSetdevice..
a03r2n12:45142:45142 [1] NCCL INFO Bootstrap : Using [0]ib0:11.1.3.33<0>
a03r2n12:45142:45142 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).
a03r2n12:45142:45142 [1] NCCL INFO NET/IB : Using [0]mlx5_3:1/IB [1]mlx5_2:1/IB [2]mlx5_1:1/IB [3]mlx5_0:1/IB ; OOB ib0:11.1.3.33<0>
NCCL version 2.4.8+hip
a03r2n12:45142:45142 [1] NCCL INFO nranks 2
a03r2n12:45142:45142 [0] NCCL INFO Setting affinity for GPU 0 to 01
a03r2n12:45142:45142 [1] NCCL INFO 0 -> 1: link type QPI hops 2
a03r2n12:45142:45142 [1] NCCL INFO 1 -> 0: link type QPI hops 2
a03r2n12:45142:45142 [0] NCCL INFO Duplicating rings to 2 per user request.
a03r2n12:45142:45142 [1] NCCL INFO Using 256 threads, Min Comp Cap 3, Trees disabled
a03r2n12:45142:45142 [1] NCCL INFO Channel 00 :    0   1
a03r2n12:45142:45142 [1] NCCL INFO Channel 01 :    0   1
a03r2n12:45142:45142 [0] NCCL INFO Setting affinity for GPU 0 to 01
a03r2n12:45142:45142 [0] NCCL INFO Setting affinity for GPU 0 to 01
a03r2n12:45142:45142 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
a03r2n12:45142:45142 [1] NCCL INFO Setting affinity for GPU 1 to 0100
a03r2n12:45142:45142 [1] NCCL INFO Setting affinity for GPU 1 to 0100
a03r2n12:45142:45142 [1] NCCL INFO Ring 00 : 1[1] -> 0[0] via direct shared memory
a03r2n12:45142:45142 [0] NCCL INFO Setting affinity for GPU 0 to 01
a03r2n12:45142:45142 [0] NCCL INFO Setting affinity for GPU 0 to 01
a03r2n12:45142:45142 [0] NCCL INFO Ring 01 : 0[0] -> 1[1] via direct shared memory
a03r2n12:45142:45142 [1] NCCL INFO Setting affinity for GPU 1 to 0100
a03r2n12:45142:45142 [1] NCCL INFO Setting affinity for GPU 1 to 0100
a03r2n12:45142:45142 [1] NCCL INFO Ring 01 : 1[1] -> 0[0] via direct shared memory
terminate called after throwing an instance of 'std::system_error’
  what():  Invalid argument
Aborted

When I run it for the third time, there will be no errors, but the program will hang. I tried the ROCm3.3, and it still has this problem. Changing ncclReduce to other communication functions has no effect.

Performance log of 4 x VEGAs

As requested in #59 (comment)

x299 OC Formula
i7 7800X(28 PCI lanes)
4 x VEGAs
Ubuntu 18.04(Kernel 4.15)

(base) syoyo@wwk:~/work/rccl-tests$ LD_LIBRARY_PATH=/opt/rocm/rccl/lib HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_gather_perf -g 4
# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   6818 on        wwk device  0 [0x0c] Vega 10 XT [Radeon RX Vega 64]
#   Rank  1 Pid   6818 on        wwk device  1 [0x19] Vega 10 XT [Radeon RX Vega 64]
#   Rank  2 Pid   6818 on        wwk device  2 [0x67] Vega 10 XT [Radeon RX Vega 64]
#   Rank  3 Pid   6818 on        wwk device  3 [0x6a] Vega 10 XT [Radeon RX Vega 64]
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       2097152   float    13058    1.93    1.93  0e+00    12866    1.96    1.96  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.94158 
#
(base) syoyo@wwk:~/work/rccl-tests$ LD_LIBRARY_PATH=/opt/rocm/rccl/lib ./build/all_gather_perf -g 4
# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   6823 on        wwk device  0 [0x0c] Vega 10 XT [Radeon RX Vega 64]
#   Rank  1 Pid   6823 on        wwk device  1 [0x19] Vega 10 XT [Radeon RX Vega 64]
#   Rank  2 Pid   6823 on        wwk device  2 [0x67] Vega 10 XT [Radeon RX Vega 64]
#   Rank  3 Pid   6823 on        wwk device  3 [0x6a] Vega 10 XT [Radeon RX Vega 64]
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
    33554432       2097152   float    13047    1.93    1.93  0e+00    12888    1.95    1.95  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.94071 
#
(base) syoyo@wwk:~/work/rccl-tests$ NCCL_DEBUG=INFO LD_LIBRARY_PATH=/opt/rocm/rccl/lib HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_gather_perf -g 4
# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   6828 on        wwk device  0 [0x0c] Vega 10 XT [Radeon RX Vega 64]
#   Rank  1 Pid   6828 on        wwk device  1 [0x19] Vega 10 XT [Radeon RX Vega 64]
#   Rank  2 Pid   6828 on        wwk device  2 [0x67] Vega 10 XT [Radeon RX Vega 64]
#   Rank  3 Pid   6828 on        wwk device  3 [0x6a] Vega 10 XT [Radeon RX Vega 64]
wwk:6828:6828 [0] NCCL INFO NET : Using interface eno1:192.168.100.25<0>
wwk:6828:6828 [0] NCCL INFO NET/Socket : 1 interfaces found
wwk:6828:6828 [0] NCCL INFO No network plugin found.

wwk:6828:6828 [0] /home/syoyo/work/rccl/build/release/src/misc/ibvwrap.cpp:63 NCCL WARN Failed to open libibverbs.so[.1]
wwk:6828:6828 [0] NCCL INFO Using network Socket
wwk:6828:6828 [0] NCCL INFO HSA_FORCE_FINE_GRAIN_PCIE set by environment to 1.
NCCL version 2.3.7+hip
wwk:6828:6828 [3] NCCL INFO nranks 4
wwk:6828:6828 [0] NCCL INFO comm 0x26f5bf0 rank 0 nranks 4
wwk:6828:6828 [1] NCCL INFO comm 0x26f7cf0 rank 1 nranks 4
wwk:6828:6828 [2] NCCL INFO comm 0x2781400 rank 2 nranks 4
wwk:6828:6828 [3] NCCL INFO comm 0x27851a0 rank 3 nranks 4
wwk:6828:6828 [0] NCCL INFO CUDA Dev 0, Socket NIC distance :  PHB
wwk:6828:6828 [1] NCCL INFO CUDA Dev 1, Socket NIC distance :  SOC
wwk:6828:6828 [2] NCCL INFO CUDA Dev 2, Socket NIC distance :  SOC
wwk:6828:6828 [3] NCCL INFO CUDA Dev 3, Socket NIC distance :  SOC
wwk:6828:6828 [3] NCCL INFO Using 256 threads
wwk:6828:6828 [3] NCCL INFO Min Comp Cap 3
wwk:6828:6828 [3] NCCL INFO Ring 00 :    0   1   2   3
wwk:6828:6828 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
wwk:6828:6828 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
wwk:6828:6828 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
wwk:6828:6828 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
wwk:6828:6828 [0] NCCL INFO Launch mode Group/Stream
    33554432       2097152   float    13044    1.93    1.93  0e+00    12897    1.95    1.95  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.94029 
#
(base) syoyo@wwk:~/work/rccl-tests$ NCCL_DEBUG=INFO LD_LIBRARY_PATH=/opt/rocm/rccl/lib ./build/all_gather_perf -g 4
# nThread 1 nGpus 4 minBytes 33554432 maxBytes 33554432 step: 1048576(bytes) warmup iters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   6833 on        wwk device  0 [0x0c] Vega 10 XT [Radeon RX Vega 64]
#   Rank  1 Pid   6833 on        wwk device  1 [0x19] Vega 10 XT [Radeon RX Vega 64]
#   Rank  2 Pid   6833 on        wwk device  2 [0x67] Vega 10 XT [Radeon RX Vega 64]
#   Rank  3 Pid   6833 on        wwk device  3 [0x6a] Vega 10 XT [Radeon RX Vega 64]
wwk:6833:6833 [0] NCCL INFO NET : Using interface eno1:192.168.100.25<0>
wwk:6833:6833 [0] NCCL INFO NET/Socket : 1 interfaces found
wwk:6833:6833 [0] NCCL INFO No network plugin found.

wwk:6833:6833 [0] /home/syoyo/work/rccl/build/release/src/misc/ibvwrap.cpp:63 NCCL WARN Failed to open libibverbs.so[.1]
wwk:6833:6833 [0] NCCL INFO Using network Socket
NCCL version 2.3.7+hip
wwk:6833:6833 [3] NCCL INFO nranks 4
wwk:6833:6833 [0] NCCL INFO comm 0x1ac0cf0 rank 0 nranks 4
wwk:6833:6833 [1] NCCL INFO comm 0x1ac2df0 rank 1 nranks 4
wwk:6833:6833 [2] NCCL INFO comm 0x1b4d450 rank 2 nranks 4
wwk:6833:6833 [3] NCCL INFO comm 0x1b4f550 rank 3 nranks 4
wwk:6833:6833 [0] NCCL INFO CUDA Dev 0, Socket NIC distance :  PHB
wwk:6833:6833 [1] NCCL INFO CUDA Dev 1, Socket NIC distance :  SOC
wwk:6833:6833 [2] NCCL INFO CUDA Dev 2, Socket NIC distance :  SOC
wwk:6833:6833 [3] NCCL INFO CUDA Dev 3, Socket NIC distance :  SOC
wwk:6833:6833 [3] NCCL INFO Using 256 threads
wwk:6833:6833 [3] NCCL INFO Min Comp Cap 3
wwk:6833:6833 [3] NCCL INFO Ring 00 :    0   1   2   3
wwk:6833:6833 [0] NCCL INFO Ring 00 : 0[0] -> 1[1] via direct shared memory
wwk:6833:6833 [1] NCCL INFO Ring 00 : 1[1] -> 2[2] via direct shared memory
wwk:6833:6833 [2] NCCL INFO Ring 00 : 2[2] -> 3[3] via direct shared memory
wwk:6833:6833 [3] NCCL INFO Ring 00 : 3[3] -> 0[0] via direct shared memory
#
#                                             out-of-place                       in-place          
#       size         count    type     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)             (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
wwk:6833:6833 [0] NCCL INFO Launch mode Group/Stream
    33554432       2097152   float    13043    1.93    1.93  0e+00    12897    1.95    1.95  0e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.94039 
#

Radeon RX 5700 XT support

Does current RCCL support AMD Radeon RX 5700 XT? I cannot find a list of supported GPUs.

With the rccl 2.7.6 release, it ends with the following error:

$ ~/rccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
/src/external/hip-on-vdi/rocclr/hip_code_object.cpp:92: guarantee(false && "hipErrorNoBinaryForGpu: Coudn't find binary for current devices!")

With the develop branch(2ecfc62), the results are wrong in some sizes:

$ ~/rccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThread: 1 nGpus: 2 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid 799159 on      tower device  0 [0x3b] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
#   Rank  1 Pid 799159 on      tower device  1 [0x86] Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
#
#                                                     out-of-place                       in-place          
#       size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2   float     sum    38.88    0.00    0.00  0e+00    31.40    0.00    0.00  0e+00
          16             4   float     sum    32.05    0.00    0.00  0e+00    30.87    0.00    0.00  0e+00
          32             8   float     sum    31.24    0.00    0.00  0e+00    31.00    0.00    0.00  0e+00
          64            16   float     sum    30.82    0.00    0.00  0e+00    31.73    0.00    0.00  0e+00
         128            32   float     sum    32.39    0.00    0.00  0e+00    32.09    0.00    0.00  0e+00
         256            64   float     sum    30.79    0.01    0.01  0e+00    29.66    0.01    0.01  0e+00
         512           128   float     sum    30.41    0.02    0.02  0e+00    27.51    0.02    0.02  0e+00
        1024           256   float     sum    27.68    0.04    0.04  0e+00    27.67    0.04    0.04  0e+00
        2048           512   float     sum    29.45    0.07    0.07  0e+00    28.51    0.07    0.07  0e+00
        4096          1024   float     sum    31.27    0.13    0.13  0e+00    29.28    0.14    0.14  0e+00
        8192          2048   float     sum    30.20    0.27    0.27  0e+00    30.72    0.27    0.27  0e+00
       16384          4096   float     sum    33.57    0.49    0.49  0e+00    34.39    0.48    0.48  0e+00
       32768          8192   float     sum    41.27    0.79    0.79  0e+00    41.88    0.78    0.78  0e+00
       65536         16384   float     sum    58.02    1.13    1.13  0e+00    56.33    1.16    1.16  0e+00
      131072         32768   float     sum    88.44    1.48    1.48  0e+00    88.37    1.48    1.48  0e+00
      262144         65536   float     sum    70.41    3.72    3.72  0e+00    68.20    3.84    3.84    inf
      524288        131072   float     sum    97.85    5.36    5.36  0e+00    97.26    5.39    5.39  0e+00
     1048576        262144   float     sum    154.5    6.79    6.79  0e+00    154.7    6.78    6.78  0e+00
     2097152        524288   float     sum    292.4    7.17    7.17  0e+00    291.1    7.20    7.20  0e+00
     4194304       1048576   float     sum    600.1    6.99    6.99  0e+00    619.1    6.77    6.77  0e+00
     8388608       2097152   float     sum   1143.3    7.34    7.34  0e+00   1145.9    7.32    7.32  0e+00
    16777216       4194304   float     sum   2136.8    7.85    7.85  3e+38   2134.4    7.86    7.86  0e+00
    33554432       8388608   float     sum   3952.4    8.49    8.49  0e+00   3938.7    8.52    8.52  0e+00
    67108864      16777216   float     sum   7674.6    8.74    8.74  0e+00   7670.7    8.75    8.75  0e+00
   134217728      33554432   float     sum    14972    8.96    8.96  0e+00    15055    8.92    8.92  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 2 FAILED
# Avg bus bandwidth    : 3.0333

With MPI-enabled rccl-tests, it ends with the following error:

$ mpirun -np 2 -H tower,palace -x HSA_FORCE_FINE_GRAIN_PCIE=1 -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,GRAPH --bind-to none --mca btl openib,self --mca btl_openib_allow_ib true ~/rccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
(... some outputs ...)
tower: Test NCCL failure common.cu:937 'internal error'
palace: Test NCCL failure common.cu:937 'internal error'

Function “ncclSendRecvKernel_copy_i8” Where was I8 called?

How to tune Broadcast performance?

Hi!
I use Broadcast. I measured the Broadcast collective call and saw its performance with HSA_FORCE_FINE_GRAIN_PCIE=1 is 1.7 GB/sec maximum. Broadcast without HSA_FORCE_FINE_GRAIN_PCIE is > 14 GB/sec. this is one order of magnitude!

I work at GPU claster with 4 AI100 connected via PCIE (#538).
RCCL was built via CXX=/opt/rocm/bin/hipcc cmake -DTRACE -DPROFILE 1 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=<my path>.

There is way to understand why the performance of Broadcast with P2P enabled (via HSA_FORCE_FINE_GRAIN_PCIE=1) is worse than HSA_FORCE_FINE_GRAIN_PCIE turned off?

Thanks!

CMake problems

System configuration:

Debian buster

HCC: df05d4fda454411e5ccb394796bfa3dbc5136918
LLVM: 213053bf7933b050aa1974532929fc3c534ec4ed
HIP: a2d465c5ba946d86f6a59c8716d1604019349cf5

Problem:

build$ CXX=/opt/rocm/hcc/bin/hcc cmake ..
-- The CXX compiler identification is Clang 11.0.0
-- Check for working CXX compiler: /opt/rocm/hcc/bin/hcc
-- Check for working CXX compiler: /opt/rocm/hcc/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.29") 
-- HIP compiler: hcc
-- HIP runtime: HCC
-- Configuring done
CMake Error in CMakeLists.txt:
  Imported target "hip::device" includes non-existent path

    "/opt/rocm/../include"

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.



CMake Error in CMakeLists.txt:
  Imported target "hip::device" includes non-existent path

    "/opt/rocm/../include"

  in its INTERFACE_INCLUDE_DIRECTORIES.  Possible reasons include:

  * The path was deleted, renamed, or moved to another location.

  * An install or uninstall procedure did not complete successfully.

  * The installation package was faulty and references files it does not
  provide.



-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

TF26's mirrored model does not distribute with simple two gpus

With two MI GPU's mirrored model did not distribute within available GPUs (2 in this case). rocm-smi --showpid only one gpus. Same code runs on both gpu if they are nvidia:

MI (load docker + tf)

alias drun='sudo docker run -it --network=host --device=/dev/kfd --device=/dev/dri --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined -v $HOME/dockerx:/dockerx'
drun rocm/tensorflow:latest

LOG:


2.6.0
2.6.0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
40960/29515 [=========================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26435584/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
16384/5148 [===============================================================================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4431872/4422102 [==============================] - 0s 0us/step
X_train_full.shape:  (60000, 28, 28)
X_train_full.dtype:  uint8
2021-11-07 23:11:19.918141: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-11-07 23:11:19.928023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 31740 MB memory:  -> device: 0, name: Device 738c, pci bus id: 0000:c3:00.0
2021-11-07 23:11:20.708667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 31740 MB memory:  -> device: 1, name: Device 738c, pci bus id: 0000:a3:00.0
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
flatten (Flatten)            (None, 784)               0
_________________________________________________________________
dense (Dense)                (None, 300)               235500
_________________________________________________________________
dense_1 (Dense)              (None, 100)               30100
_________________________________________________________________
dense_2 (Dense)              (None, 30)                3030
=================================================================
Total params: 268,630
Trainable params: 268,630
Non-trainable params: 0
_________________________________________________________________
model summary:  None
2021-11-07 23:11:22.146723: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:695] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_234"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto_shard_policy = AutoShardPolicy.DATA` before applying the options object to the dataset via `dataset.with_options(options)`.
2021-11-07 23:11:22.173320: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
2021-11-07 23:11:22.209968: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.217416: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.223131: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.226946: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.351371: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.354764: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.358438: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
2021-11-07 23:11:22.361702: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
Epoch 1/10
2021-11-07 23:11:24.962672: I tensorflow/core/common_runtime/gpu_fusion_pass.cc:507] ROCm Fusion is enabled.
1171/1719 [===================>..........] - ETA: 2s - loss: 0.8513 - accuracy: 0.7319^CTraceback (most recent call last):
  File "p710.py", line 36, in <module>
    history=mirrored_model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1184, in fit
    tmp_logs = self.train_function(iterator)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 885, in __call__
    result = self._call(*args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py", line 917, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 3040, in __call__
    filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 1964, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py", line 596, in call
    ctx=ctx)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
KeyboardInterrupt

Every 1.0s: rocm-smi --showpids                                                          gb-sjc2-28: Sun Nov  7 23:12:14 2021



======================= ROCm System Management Interface =======================
================================ KFD Processes =================================
KFD process information:
PID     PROCESS NAME    GPU(s)  VRAM USED       SDMA USED       CU OCCUPANCY
5113    UNKNOWN         2       66634948608     0               0
================================================================================
============================= End of ROCm SMI Log ==============================

CODE:

# Using neural net to do a classification task.

import tensorflow as tf
import pandas as pd
import matplotlib as plt

from tensorflow import keras
print(tf.__version__)
print(keras.__version__)

CONFIG_ENABLE_PLOT=0

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()
print("X_train_full.shape: ", X_train_full.shape)
print("X_train_full.dtype: ", X_train_full.dtype)

X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:]/255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.0
class_names = ["T-shirt/top","Trouser", "Pullover", "Dress", "Coat" , "Sandal", "Shirt", "Sneaker","Bad","Ankle boot"]

distribution = tf.distribute.MirroredStrategy()

with distribution.scope():
    mirrored_model=keras.models.Sequential()
    mirrored_model.add(keras.layers.Flatten(input_shape = [28, 28]))
    mirrored_model.add(keras.layers.Dense(300, activation="relu"))
    mirrored_model.add(keras.layers.Dense(100, activation="relu"))
    mirrored_model.add(keras.layers.Dense(30, activation="softmax"))

print("model summary: ", mirrored_model.summary())

mirrored_model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd", metrics=["accuracy"])
batch_size=100
history=mirrored_model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid))

pd.DataFrame(history.history).plot(figsize=(8, 5))

if CONFIG_ENABLE_PLOT:
    plt.pyplot.grid(True)
    plt.pyplot.gca().set_ylim(0, 1)
    plt.pyplot.show()

mirrored_model.evaluate(X_test, y_test)

print("mirrored_model layers: ", mirrored_model.layers)
weights, biases  = mirrored_model.layers[1].get_weights()
print("weights, biases (shapes): ", weights, biases, weights.shape, biases.shape)
mirrored_model.save("p297.h5")
X_new = X_test[:3]
y_proba = mirrored_model.predict(X_new)
print(y_proba.round(2))

y_pred = mirrored_model.predict_classes(X_new)
print("y_pred: ", y_pred)

NVIDIA RUN (RTX2070 + GTX1080)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| 41%   31C    P2    41W / 215W |   7098MiB /  7981MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:03:00.0 Off |                  N/A |
|  0%   23C    P2    29W / 151W |   7697MiB /  8119MiB |     29%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1933      C   python3                          7095MiB |
|    1   N/A  N/A      1933      C   python3                          7693MiB |
+-----------------------------------------------------------------------------+

rcclGetErrorString should be inlined

@sunway513
Function rcclGetErrorString will have multiple copies and break linking now because it isn't inlined.
https://github.com/ROCmSoftwarePlatform/rccl/blob/2fae001fcecde4228eae80b32ea3a98878ccf3c0/inc/rccl.h#L52

Encountered build failure in ~/rccl/tests/validation/kernels

rocm@prj47-rack-40:~/rccl/tests/validation/kernels$ make
mkdir -p bin
/opt/rocm/bin/hipcc -I../ -I../../../src -I../../../inc -I../../ --amdgpu-target=gfx900 rcclKernelCopy.cpp -o ./bin/rcclKernelCopy
rcclKernelCopy.cpp:8:10: fatal error: 'rcclKernels.h' file not found
#include "rcclKernels.h"
         ^~~~~~~~~~~~~~~
1 error generated.
rcclKernelCopy.cpp:8:10: fatal error: 'rcclKernels.h' file not found
#include "rcclKernels.h"
         ^~~~~~~~~~~~~~~
1 error generated.
Died at /opt/rocm/bin/hipcc line 565.
Makefile:9: recipe for target 'rcclKernelCopy' failed
make: *** [rcclKernelCopy] Error 2

iostream overload for __fp16 missing when building tests

many errors are generated when building "validation" or "performance":

Reset commit to "Merge pull request #29 from ROCmSoftwarePlatform/destroy-hang-fix" then build was fine again.

example:

ccl/tests/validation$ make
mkdir -p bin
/opt/rocm/hip/bin/hipcc -I/opt/rocm/rccl/include -I../ --amdgpu-target=gfx900 --amdgpu-target=gfx803 --amdgpu-target=gfx906 rcclCommBcast.cpp -L/opt/rocm/rccl/lib -lrccl -o ./bin/commBcast
mkdir -p bin
/opt/rocm/hip/bin/hipcc -I/opt/rocm/rccl/include -I../ --amdgpu-target=gfx900 --amdgpu-target=gfx803 --amdgpu-target=gfx906 rcclBcast.cpp -L/opt/rocm/rccl/lib -lrccl -o ./bin/bcast
In file included from rcclBcast.cpp:14:
../validation/validate.h:50:17: error: use of overloaded operator '<<' is ambiguous (with operand types 'basic_ostream<char, std::char_traits >' and '__fp16')
CHECKVAL(ptr[i], val, i);
^~~~~~~~~~~~~~~~~~~~~~~~
ccl/tests/validation$ make
mkdir -p bin
/opt/rocm/hip/bin/hipcc -I/opt/rocm/rccl/include -I../ --amdgpu-target=gfx900 --amdgpu-target=gfx803 --amdgpu-target=gfx906 rcclCommBcast.cpp -L/opt/rocm/rccl/lib -lrccl -o ./bin/commBcast
mkdir -p bin
/opt/rocm/hip/bin/hipcc -I/opt/rocm/rccl/include -I../ --amdgpu-target=gfx900 --amdgpu-target=gfx803 --amdgpu-target=gfx906 rcclBcast.cpp -L/opt/rocm/rccl/lib -lrccl -o ./bin/bcast
In file included from rcclBcast.cpp:14:
../validation/validate.h:50:17: error: use of overloaded operator '<<' is ambiguous (with operand types 'basic_ostream<char, std::char_traits >' and '__fp16')
CHECKVAL(ptr[i], val, i);
^~~~~~~~~~~~~~~~~~~~~~~~

"Cannot find Symbol" for a program linked with -fgpu-rdc

The original program uses MPI with one GPU per process and some NCCL operations.
A simple test with one process-one thread-multiple GPUs: rccl_rdc.zip (zip because GitHub doesn't allow to upload .cpp)

It works without RDC:

hipcc rccl_rdc.cpp -o rccl_rdc -l rccl
AMD_LOG_LEVEL=4 HSA_FORCE_FINE_GRAIN_PCIE=1 ./rccl_rdc

But it fails with RDC:

hipcc -fgpu-rdc rccl_rdc.cpp -o rccl_rdc -l rccl
AMD_LOG_LEVEL=4 HSA_FORCE_FINE_GRAIN_PCIE=1 ./rccl_rdc

:3:hip_module.cpp           :590 : 2065834582307 us: 1665178: [7fad85723080] hipExtLaunchMultiKernelMultiDevice ( 0x7fa76806a070, 2, 3 )
:3:devprogram.cpp           :2463: 2065834582494 us: Using Code Object V4.
:1:hip_global.cpp           :68  : 2065834582995 us: Cannot find Symbol with name: _Z42ncclKernel_SendRecv_RING_SIMPLE_Sum_int8_t12ncclWorkElem 

"Cannot find Symbol"
Aborted (core dumped)

Full log: rccl_rdc.log

As you can see I don't even need to really use a separate compilation of multiple files and then link with them into a single binary (as it's done for the original program).

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage

subj

Arch Linux

Change number of workitems launched in kernels

Hi,
This is an obvious optimization. Currently, a copy or reduction kernel operating on remote buffers are launched with number of workitems = number of elements in buffer. We found that using a single workgroup with 1024 or 512 or 256 workitems can achieve same bandwidth.
So, change the current implementation of kernels to use a single workgroup to do reduction ops and data transfers.

bug，I can't find why this bug appears, I hope to ask for help

When I use "make -j1", these errors are prompted
“[ 1%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_reduce.cpp.o
LLVM ERROR: out of memory
Allocation failed
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0. Program arguments: /opt/rocm/llvm/bin/clang-12 -cc1 -mllvm --amdhsa-code-object-version=4 -triple amdgcn-amd-amdhsa -aux-triple mips64el-unknown-linux-gnu -emit-llvm-bc -emit-llvm-uselists -disable-free -main-file-name all_reduce.cpp -mrelocation-model pic -pic-level 2 -mframe-pointer=none -fdenormal-fp-math-f32=preserve-sign,preserve-sign -fno-rounding-math -mconstructor-aliases -aux-target-cpu mips64r2 -aux-target-feature -noabicalls -fcuda-is-device -fgpu-rdc -fcuda-allow-variadic-functions -fvisibility hidden -fapply-global-visibility-to-externs -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/hip.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/ocml.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/ockl.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_daz_opt_on.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_off.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_finite_only_off.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_on.bc -mlink-builtin-bitcode /opt/rocm/amdgcn/bitcode/oclc_isa_version_803.bc -target-cpu gfx803 -fno-split-dwarf-inlining -debugger-tuning=gdb -resource-dir /opt/rocm/llvm/lib/clang/12.0.0 -internal-isystem /opt/rocm/llvm/lib/clang/12.0.0/include/cuda_wrappers -internal-isystem /opt/rocm/include -include __clang_hip_runtime_wrapper.h -isystem /opt/rocm/llvm/lib/clang/12.0.0/include/.. -isystem /opt/rocm/hsa/include -isystem /opt/rocm/include -D HIP_ROCclr -D HIP_ROCclr -D ENABLE_COLLTRACE -D HIP_PLATFORM_HCC=1 -D HIP_ROCclr=1 -D rccl_EXPORTS -I /home/loongson/ROCm/rccl/build -I /home/loongson/ROCm/rccl/src -I /home/loongson/ROCm/rccl/src/include -I /home/loongson/ROCm/rccl/src/collectives -I /home/loongson/ROCm/rccl/src/collectives/device -c-isystem /usr/local/openblas/include -c-isystem /usr/local/deployment_tools/inference_engine/include -c-isystem /usr/local/openblas/include -c-isystem /usr/local/deployment_tools/inference_engine/include -c-isystem . -cxx-isystem /opt/rocm/rocclr/include/compiler/lib/include -cxx-isystem /opt/rocm/rocclr/include/elf -cxx-isystem /opt/rh/devtoolset-7/root/usr/include/c++/7/mips64el-redhat-linux -cxx-isystem /opt/rh/devtoolset-7/root/usr/include/c++/7 -cxx-isystem /usr/local/openblas/include -cxx-isystem /usr/local/deployment_tools/inference_engine/include -cxx-isystem /opt/rocm/rocclr/include/compiler/lib/include -cxx-isystem /opt/rocm/rocclr/include/elf -cxx-isystem /opt/rh/devtoolset-7/root/usr/include/c++/7/mips64el-redhat-linux -cxx-isystem /opt/rh/devtoolset-7/root/usr/include/c++/7 -cxx-isystem /usr/local/openblas/include -cxx-isystem /usr/local/deployment_tools/inference_engine/include -cxx-isystem . -internal-isystem /usr/local/include -internal-isystem /opt/rocm/llvm/lib/clang/12.0.0/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -internal-isystem /usr/local/include -internal-isystem /opt/rocm/llvm/lib/clang/12.0.0/include -internal-externc-isystem /include -internal-externc-isystem /usr/include -O3 -Wno-format-nonliteral -std=c++14 -fdeprecated-macro -fno-autolink -fdebug-compilation-dir /home/loongson/ROCm/rccl/build -ferror-limit 19 -fhip-new-launch-api -fgnuc-version=4.2.1 -fcxx-exceptions -fexceptions -fcolor-diagnostics -vectorize-loops -vectorize-slp -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false -fcuda-allow-variadic-functions -faddrsig -o /tmp/all_reduce-gfx803-a94c35.bc -x hip /home/loongson/ROCm/rccl/build/src/collectives/device/all_reduce.cpp

parser at end of file
Per-module optimization passes
Running pass 'CallGraph Pass Manager' on module '/home/loongson/ROCm/rccl/build/src/collectives/device/all_reduce.cpp'.
Stack dump without symbol names (ensure you have llvm-symbolizer in your PATH or set the environment var LLVM_SYMBOLIZER_PATH to point to it):
/opt/rocm/llvm/bin/clang-12(_ZN4llvm3sys15PrintStackTraceERNS_11raw_ostreamEi+0x40)[0x121e1cd00]
clang-12: error: unable to execute command: Aborted
clang-12: error: clang frontend command failed due to signal (use -v to see invocation)
clang version 12.0.0 (/home/loongson/ROCm/llvm-project/clang 91c055b1c52a830317cae78986d0d74e433864c6)
Target: mips64el-unknown-linux-gnu
Thread model: posix
InstalledDir: /opt/rocm/llvm/bin
clang-12: note: diagnostic msg: Error generating preprocessed source(s).
CMakeFiles/rccl.dir/build.make:81: recipe for target 'CMakeFiles/rccl.dir/src/collectives/device/all_reduce.cpp.o' failed
make[2]: *** [CMakeFiles/rccl.dir/src/collectives/device/all_reduce.cpp.o] Error 254
CMakeFiles/Makefile2:94: recipe for target 'CMakeFiles/rccl.dir/all' failed
make[1]: *** [CMakeFiles/rccl.dir/all] Error 2
Makefile:170: recipe for target 'all' failed
make: *** [all] Error 2
”
When I use "make -j4", the error message is prompted
”llvm-objcopy: error: invalid symbol index: 119014656
/opt/rocm/llvm/bin/clang-offload-bundler: error: 'llvm-objcopy' tool failed
clang-12: error: clang-offload-bundler command failed with exit code 1 (use -v to see invocation)
CMakeFiles/rccl.dir/build.make:94: recipe for target 'CMakeFiles/rccl.dir/src/collectives/device/all_gather.cpp.o' failed
make[2]: *** [CMakeFiles/rccl.dir/src/collectives/device/all_gather.cpp.o] Error 1
make[2]: *** 正在等待未完成的任务....
llvm-objcopy: error: invalid symbol index: 119014656
/opt/rocm/llvm/bin/clang-offload-bundler: error: 'llvm-objcopy' tool failed
clang-12: error: clang-offload-bundler command failed with exit code 1 (use -v to see invocation)
CMakeFiles/rccl.dir/build.make:120: recipe for target 'CMakeFiles/rccl.dir/src/collectives/device/broadcast.cpp.o' failed
make[2]: *** [CMakeFiles/rccl.dir/src/collectives/device/broadcast.cpp.o] Error 1
“
and it is this again. I can't find the reason for these errors and where they are

Does RCCL support the Infinity Fabric?

Hello!
I have MI100 GPUs are interconnected via Infinity Fabric and I'd like to use it with RCCL. The tool TransferBench give me only 27.5 Gbytes per second. Perhaps, does RCCL has any options to turn on Infinity Fabric?
Thanks!

Cmake command is failing to find ROCM

After installing latest ROCm in a clean Ubuntu 16.04 docker container. I tried to follow the manual build instructions, but the cmake command is failing to find ROCM. Do you know if I missed something?

root@91153f223841:~/rccl/build# CXX=/opt/rocm/bin/hcc cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
-- The CXX compiler identification is Clang 9.0.0
-- Check for working CXX compiler: /opt/rocm/bin/hcc
-- Check for working CXX compiler: /opt/rocm/bin/hcc -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
CMake Error at CMakeLists.txt:9 (find_package):
  Could not find a package configuration file provided by "ROCM" with any of
  the following names:

    ROCMConfig.cmake
    rocm-config.cmake

  Add the installation prefix of "ROCM" to CMAKE_PREFIX_PATH or set
  "ROCM_DIR" to a directory containing one of the above files.  If "ROCM"
  provides a separate development package or SDK, be sure it has been
  installed.


-- Configuring incomplete, errors occurred!
See also "/root/rccl/build/CMakeFiles/CMakeOutput.log".

Build fails following standard procedure

~/git/rccl-git/build/ make -j 8                                                                                                                                                [0]
Scanning dependencies of target rccl
[  2%] Building CXX object CMakeFiles/rccl.dir/src/collectives/broadcast.cpp.o
[  5%] Building CXX object CMakeFiles/rccl.dir/src/collectives/reduce_scatter.cpp.o
[ 11%] Building CXX object CMakeFiles/rccl.dir/src/init.cpp.o
[ 11%] Building CXX object CMakeFiles/rccl.dir/src/collectives/all_reduce.cpp.o
[ 13%] Building CXX object CMakeFiles/rccl.dir/src/collectives/reduce.cpp.o
[ 16%] Building CXX object CMakeFiles/rccl.dir/src/collectives/all_gather.cpp.o
[ 19%] Building CXX object CMakeFiles/rccl.dir/src/bootstrap.cpp.o
[ 22%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/functions.cpp.o
[ 25%] Building CXX object CMakeFiles/rccl.dir/src/misc/enqueue.cpp.o
[ 27%] Building CXX object CMakeFiles/rccl.dir/src/misc/group.cpp.o
[ 30%] Building CXX object CMakeFiles/rccl.dir/src/misc/ibvwrap.cpp.o
[ 33%] Building CXX object CMakeFiles/rccl.dir/src/misc/nvmlwrap_stub.cpp.o
[ 36%] Building CXX object CMakeFiles/rccl.dir/src/misc/rings.cpp.o
[ 38%] Building CXX object CMakeFiles/rccl.dir/src/misc/utils.cpp.o
[ 41%] Building CXX object CMakeFiles/rccl.dir/src/ring.cpp.o
[ 44%] Building CXX object CMakeFiles/rccl.dir/src/transport.cpp.o
[ 47%] Building CXX object CMakeFiles/rccl.dir/src/transport/net.cpp.o
[ 50%] Building CXX object CMakeFiles/rccl.dir/src/transport/net_ib.cpp.o
[ 52%] Building CXX object CMakeFiles/rccl.dir/src/transport/net_socket.cpp.o
[ 55%] Building CXX object CMakeFiles/rccl.dir/src/transport/p2p.cpp.o
[ 58%] Building CXX object CMakeFiles/rccl.dir/src/transport/shm.cpp.o
[ 61%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_gather_0.cpp.o
[ 63%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_reduce_0.cpp.o
[ 66%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_reduce_1.cpp.o
[ 69%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_reduce_2.cpp.o
[ 72%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/all_reduce_3.cpp.o
[ 75%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/broadcast_0.cpp.o
[ 77%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_0.cpp.o
[ 80%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_1.cpp.o
[ 83%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_2.cpp.o
[ 86%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_3.cpp.o
[ 88%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter_0.cpp.o
[ 91%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter_1.cpp.o
[ 94%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter_2.cpp.o
[ 97%] Building CXX object CMakeFiles/rccl.dir/src/collectives/device/reduce_scatter_3.cpp.o
[100%] Linking CXX shared library librccl.so
Error opening '/opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so': /opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so: cannot open shared object file: No such file or directory
  -load request ignored.
Error opening '/opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so': /opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so: cannot open shared object file: No such file or directory
  -load request ignored.
Error opening '/opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so': /opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so: cannot open shared object file: No such file or directory
  -load request ignored.
Error opening '/opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so': /opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so: cannot open shared object file: No such file or directory
  -load request ignored.
Error opening '/opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so': /opt/rocm/bin/../lib/LLVMSelectAcceleratorCode.so: cannot open shared object file: No such file or directory
  -load request ignored.
Error opening '/opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so': /opt/rocm/bin/../lib/LLVMPromotePointerKernArgsToGlobal.so: cannot open shared object file: No such file or directory
  -load request ignored.
opt: Unknown command line argument '-select-accelerator-code'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --split-spill-mode'?
opt: Unknown command line argument '-select-accelerator-code'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --split-spill-mode'?
opt: Unknown command line argument '-select-accelerator-code'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --split-spill-mode'?
opt: Unknown command line argument '-sac-enable-function-calls=1'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --amdgpu-function-calls=1'?
opt: Unknown command line argument '-sac-enable-function-calls=1'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --amdgpu-function-calls=1'?
opt: Unknown command line argument '-sac-enable-function-calls=1'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --amdgpu-function-calls=1'?
opt: Unknown command line argument '-promote-pointer-kernargs-to-global'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --remat-pic-stub-load'?
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx803
opt: Unknown command line argument '-promote-pointer-kernargs-to-global'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --remat-pic-stub-load'?
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx900
opt: Unknown command line argument '-promote-pointer-kernargs-to-global'.  Try: '/opt/rocm/bin/opt --help'
opt: Did you mean '  --remat-pic-stub-load'?
Generating AMD GCN kernel failed in HCC-specific opt passes for target: gfx906
clang-9: error: linker command failed with exit code 7 (use -v to see invocation)
make[2]: *** [CMakeFiles/rccl.dir/build.make:598: librccl.so] Error 7
make[1]: *** [CMakeFiles/Makefile2:73: CMakeFiles/rccl.dir/all] Error 2
make: *** [Makefile:152: all] Error 2

I couldn't find any informations about thes LLVM files... and they seem not to be present in ROCm/tensorflow:latest Docker image. So I really do not understand here.
(info : I'm using Archlinux, thus I built most ROCm packages myself)

Stale versions in README

This (and some related text) is very dated:

These binaries are currently packaged with roc-master, and will be included in ROCm 2.4.

Also, it would be helpful to explain up-front what the value proposition of calling rccl directly instead of device-aware MPI. The NCCL slides show embarrassingly terrible performance (evidently without irony) and it's hard to tell if RCCL is better, let alone competitive with MPI.

kernel function is always sendrecv

#218

During the performance analysis of this modification, the kernel function is always sendrecv. Is there any way to know which kernel function is called?

On rocm4.0, What are the possible reasons for the following errors? Drive?

ImportError: /opt/rocm/lib/librccl.so: undefined symbol: hipExtLaunchMultiKernelMultiDevice

After installing rccl I get this error when I import tensorflow in python:
/rocm/lib/librccl.so: undefined symbol: hipExtLaunchMultiKernelMultiDevice

I installed rccl by creating a deb with the provided install script:
./install.sh -p
sudo dpkg -i rccl-2.6.0-1969e89-Linux.deb

versions:
tensorflow-rocm 1.14
Ubuntu 18.10

Is need use a paired ncclSend/ncclRecv call?

Hello!
I have the questions)

Q1
I use a P2P collective. GPUs are connected via Infinity Fabric.
I want to send data from GPU 0 to GPU 1. I have a misunderstanding here:
A) Is only one call of ncclSend(...peer=1...) enough on side GPU 0 (or ncclRecv(...peer=0...) on side GPU 1)?
B) Or sould I do a couple of paired collective?:
* on GPU 0: ncclSend( ...peer=1...)
* on GPU 1: ncclRecv( ...peer=0...)

It looks like TransferBench uses a single launch of GpuCopyKernel (https://github.com/ROCmSoftwarePlatform/rccl/blob/685bcea1275c5fd400b1784c393791a2ee11c947/tools/TransferBench/TransferBench.cpp#L1021) to copy from GPU 0 to GPU 1.

Q2
Something will change if I change Infinity Fabric to PCIE?

Thanks in advance for any information!

AllGather / AllReduce hang

RCCL VERSION 2708

I use 4 cards managed by 4 processes to train the seq2seq network, and it hangs in ncclAllGather / ncclAllReduce

The processes 0,1,3 call stack is

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::BasicEngine::Execute()
1   paddle::imperative::GradientAccumulator::CallReduceHooks()
2   paddle::imperative::Reducer::AddDistHook(unsigned long)
3   paddle::imperative::Reducer::MarkVarReady(unsigned long, bool)
4   paddle::imperative::Reducer::MarkGroupReady(unsigned long)
5   paddle::imperative::Reducer::FusedAllReduceSchedule(int, paddle::imperative::Group&, int)
6   paddle::imperative::NCCLParallelContext::AllReduceByStream(paddle::framework::Variable const&, paddle::framework::Variable*, int, bool)
7   paddle::imperative::AllReduce(paddle::framework::Variable const&, paddle::framework::Variable*, paddle::imperative::ParallelStrategy const&, int, bool)
8   ncclAllReduce
9   ncclEnqueueCheck(ncclInfo*)
10  ncclBarrierEnqueueWait(ncclComm*)
11  __device_stub__ncclSendRecvKernel_copy_i8(ncclDevComm*)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1632816581 (unix time) try "date -d @1632816581" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0xaf99) received by PID 45015 (TID 0x7f5e0faa4740) from PID 44953 ***]

The processes 0,1,3 logs are

h03r3n06:45006:45006 [0] NCCL INFO AllReduce: opCount 28c sendbuff 0x7f5dd32b2100 recvbuff 0x7f5dd32b2100 count 9984512 datatype 7 op 0 root 0 comm 0x55bef0201370 [nranks=4] stream 0x55bef0208bd0
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.023938] [00:00] 00028c KL hwid 42802200 funcIndex 1508
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.023939] [00:01] 00028c KL hwid 42806120 funcIndex 1508
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.143739] [00:00] 00028c KE
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.143750] [00:01] 00028c KE
h03r3n06:45006:45006 [0] NCCL INFO AllReduce: opCount 28d sendbuff 0x7f5dd139b900 recvbuff 0x7f5dd139b900 count 8149504 datatype 7 op 0 root 0 comm 0x55bef0201370 [nranks=4] stream 0x55bef0208bd0
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.309984] [00:00] 00028d KL hwid 42802500 funcIndex 1508
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.309985] [00:01] 00028d KL hwid 42806620 funcIndex 1508
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.333012] [00:01] 00028d KE
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.333015] [00:00] 00028d KE
h03r3n06:45006:45006 [0] NCCL INFO AllReduce: opCount 28e sendbuff 0x7f5dcf208100 recvbuff 0x7f5dcf208100 count 8801792 datatype 7 op 0 root 0 comm 0x55bef0201370 [nranks=4] stream 0x55bef0208bd0
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.335126] [00:01] 00028e KL hwid 42802200 funcIndex 1508
h03r3n06:45006:45113 [0] NCCL INFO ## [10042496.335126] [00:00] 00028e KL hwid 42800200 funcIndex 1508

h03r3n06:45009:45009 [1] NCCL INFO AllReduce: opCount 28c sendbuff 0x7fbe592b2100 recvbuff 0x7fbe592b2100 count 9984512 datatype 7 op 0 root 0 comm 0x55bcee8de020 [nranks=4] stream 0x55bcee8e62d0
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.726389] [01:00] 00028c KL hwid 42802320 funcIndex 1508
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.726389] [01:01] 00028c KL hwid 42806620 funcIndex 1508
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.831862] [01:00] 00028c KE
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.831873] [01:01] 00028c KE
h03r3n06:45009:45009 [1] NCCL INFO AllReduce: opCount 28d sendbuff 0x7fbe5739b900 recvbuff 0x7fbe5739b900 count 8149504 datatype 7 op 0 root 0 comm 0x55bcee8de020 [nranks=4] stream 0x55bcee8e62d0
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.983146] [01:00] 00028d KL hwid 42802930 funcIndex 1508
h03r3n06:45009:45116 [0] NCCL INFO ## [10042497.983147] [01:01] 00028d KL hwid 42806f20 funcIndex 1508
h03r3n06:45009:45116 [0] NCCL INFO ## [10042498.021146] [01:01] 00028d KE
h03r3n06:45009:45116 [0] NCCL INFO ## [10042498.021147] [01:00] 00028d KE
h03r3n06:45009:45009 [1] NCCL INFO AllReduce: opCount 28e sendbuff 0x7fbe55208100 recvbuff 0x7fbe55208100 count 8801792 datatype 7 op 0 root 0 comm 0x55bcee8de020 [nranks=4] stream 0x55bcee8e62d0
h03r3n06:45009:45116 [0] NCCL INFO ## [10042498.023144] [01:00] 00028e KL hwid 42800230 funcIndex 1508
h03r3n06:45009:45116 [0] NCCL INFO ## [10042498.023145] [01:01] 00028e KL hwid 42802010 funcIndex 1508

h03r3n06:45015:45015 [3] NCCL INFO AllReduce: opCount 28c sendbuff 0x7f597d2b2100 recvbuff 0x7f597d2b2100 count 9984512 datatype 7 op 0 root 0 comm 0x5611f23d07c0 [nranks=4] stream 0x5611f23da810
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.466647] [03:01] 00028c KL hwid 42802110 funcIndex 1508
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.466648] [03:00] 00028c KL hwid 42800d00 funcIndex 1508
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.472905] [03:00] 00028c KE
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.472912] [03:01] 00028c KE
h03r3n06:45015:45015 [3] NCCL INFO AllReduce: opCount 28d sendbuff 0x7f597b39b900 recvbuff 0x7f597b39b900 count 8149504 datatype 7 op 0 root 0 comm 0x5611f23d07c0 [nranks=4] stream 0x5611f23da810
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.657783] [03:01] 00028d KL hwid 42802b10 funcIndex 1508
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.657784] [03:00] 00028d KL hwid 42800d30 funcIndex 1508
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.662014] [03:00] 00028d KE
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.662019] [03:01] 00028d KE
h03r3n06:45015:45015 [3] NCCL INFO AllReduce: opCount 28e sendbuff 0x7f5979208100 recvbuff 0x7f5979208100 count 8801792 datatype 7 op 0 root 0 comm 0x5611f23d07c0 [nranks=4] stream 0x5611f23da810
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.664283] [03:01] 00028e KL hwid 42800b10 funcIndex 1508
h03r3n06:45015:45115 [0] NCCL INFO ## [10042496.664283] [03:00] 00028e KL hwid 42804c30 funcIndex 1508

The process 2 call stack is

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::imperative::Tracer::TraceOp(std::string const&, paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap, paddle::platform::Place const&, bool, std::map<std::string, std::string, std::less<std::string>, std::allocator<std::pair<std::string const, std::string> > > const&)
1   paddle::imperative::PreparedOp::Run(paddle::imperative::NameVarBaseMap const&, paddle::imperative::NameVarBaseMap const&, paddle::framework::AttributeMap const&, paddle::framework::AttributeMap const&)
2   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CUDAPlace, false, 3ul, paddle::operators::CAllGatherOpCUDAKernel<float>, paddle::operators::CAllGatherOpCUDAKernel<double>, paddle::operators::CAllGatherOpCUDAKernel<int>, paddle::operators::CAllGatherOpCUDAKernel<long>, paddle::operators::CAllGatherOpCUDAKernel<paddle::platform::float16> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
3   paddle::operators::CAllGatherOpCUDAKernel<long>::Compute(paddle::framework::ExecutionContext const&) const
4   ncclAllGather
5   ncclEnqueueCheck(ncclInfo*)
6   ncclBarrierEnqueueWait(ncclComm*)
7   __device_stub__ncclSendRecvKernel_copy_i8(ncclDevComm*)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1632816581 (unix time) try "date -d @1632816581" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0xaf99) received by PID 45009 (TID 0x7fc2e4bf8740) from PID 44953 ***]

The process 2 log is

h03r3n06:45012:45012 [2] NCCL INFO AllGather: opCount 28c sendbuff 0x7f1fd894af00 recvbuff 0x7f1ed1c00100 count 59205120 datatype 0 op 0 root 0 comm 0x559c24b55610 [nranks=4] stream 0x559c248a3630
h03r3n06:45012:45114 [0] NCCL INFO ## [10042493.600355] [02:01] 00028c KL hwid 42800fa0 funcIndex 725
h03r3n06:45012:45114 [0] NCCL INFO ## [10042493.600356] [02:00] 00028c KL hwid 42804e90 funcIndex 725
h03r3n06:45012:45012 [2] NCCL INFO AllGather: opCount 28d sendbuff 0x7f1fce20e500 recvbuff 0x7f2032532100 count 15360 datatype 0 op 0 root 0 comm 0x559c24b55610 [nranks=4] stream 0x559c248a3630
h03r3n06:45012:45114 [0] NCCL INFO ## [10042494.517377] [02:01] 00028c KE
h03r3n06:45012:45114 [0] NCCL INFO ## [10042494.517377] [02:00] 00028c KE
h03r3n06:45012:45114 [0] NCCL INFO ## [10042494.517839] [02:01] 00028d KL hwid 42806b80 funcIndex 723
h03r3n06:45012:45114 [0] NCCL INFO ## [10042494.517840] [02:00] 00028d KL hwid 42802fb0 funcIndex 723

What common reasons or ideas can help me troubleshoot this problem?

Please kindly help to comment, thanks!

Perf About `HSA_FORCE_FINE_GRAIN_PCIE=1`

Does HSA_FORCE_FINE_GRAIN_PCIE=1 improve the performance?
By evaluating HSA_FORCE_FINE_GRAIN_PCIE=1 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4, I saw the bandwidth is much higher than not setting HSA_FORCE_FINE_GRAIN_PCIE=1, and the time cost in us is also shorten.

However, when enabling HSA_FORCE_FINE_GRAIN_PCIE=1 for tensorflow-rocm and evaluating rccl allreduce benchmark by HSA_FORCE_FINE_GRAIN_PCIE=1 python3 tf_cnn_benchmarks.py --num_gpus=4 --model resnet50 --batch_size=128 --variable_update=replicated --all_reduce_spec=nccl, the performance is much worse.

Memory access fault by GPU node-2 ROCM 4.3 dual 6800XT

$ uname -a
Linux ian-TRX40-AORUS-PRO-WIFI 5.8.0-63-generic #71~20.04.1-Ubuntu SMP Thu Jul 15 17:46:08 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

$ /opt/rocm/opencl/bin/clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.0 AMD-APP (3305.0)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback

Platform Name: AMD Accelerated Parallel Processing
Number of devices: 2
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Device 73bf
Device Topology: PCI[ B#35, D#0, F#0 ]
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2575Mhz
Address bits: 64
Max memory allocation: 14588628168
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 29631
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 17163091968
Constant buffer size: 14588628168
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1703726280
Max global variable size: 14588628168
Max global variable preferred total size: 17163091968
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 32
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: Yes
Profiling : Yes
Platform ID: 0x7f5a474e4e10
Name: gfx1030
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 2.0
Driver version: 3305.0 (HSA1.1,LC)
Profile: FULL_PROFILE
Version: OpenCL 2.0
Extensions: cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program

Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: Device 73bf
Device Topology: PCI[ B#75, D#0, F#0 ]
Max compute units: 36
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 2575Mhz
Address bits: 64
Max memory allocation: 14588628168
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 16384
Max image 3D height: 16384
Max image 3D depth: 8192
Max samplers within kernel: 29631
Max size of kernel argument: 1024
Alignment (bits) of base address: 1024
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: Yes
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 17163091968
Constant buffer size: 14588628168
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 65536
Max pipe arguments: 16
Max pipe active reservations: 16
Max pipe packet size: 1703726280
Max global variable size: 14588628168
Max global variable preferred total size: 17163091968
Max read/write image args: 64
Max on device events: 1024
Queue on device max size: 8388608
Max on device queues: 1
Queue on device preferred size: 262144
SVM capabilities:
Coarse grain buffer: Yes
Fine grain buffer: Yes
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Memory access fault by GPU node-2 (Agent handle: 0x55911510be90) on address (nil). Reason: Page not present or supervisor privilege.
Aborted (core dumped)

Use device index from comm id

Hi,
Currently, when debug mode is enabled, RCCL uses HIP runtime api to query current gpu device. Instead of that, use the device index embedded inside rccl communicator.
Change from

    if ((RCCL_TRACE_RT & krccl_print_api) == krccl_print_api) {
        int dev;
        hipGetDevice(&dev);
        fprintf(stderr,
                "%s<<rccl-api:%s rccl-device:%d sendbuff:%p recvbuff:%p "
                "count:%d datatype:%s op:%s comm:%p stream:%p%s\n",
                API_COLOR, __func__, dev, sendbuff, recvbuff, count,
                umap_datatype[datatype].c_str(), umap_red_op[op].c_str(), comm,
                stream, API_COLOR_END);
}

    if ((RCCL_TRACE_RT & krccl_print_api) == krccl_print_api) {
        fprintf(stderr,
                "%s<<rccl-api:%s rccl-device:%d sendbuff:%p recvbuff:%p "
                "count:%d datatype:%s op:%s comm:%p stream:%p%s\n",
                API_COLOR, __func__, comm->device_, sendbuff, recvbuff, count,
                umap_datatype[datatype].c_str(), umap_red_op[op].c_str(), comm,
                stream, API_COLOR_END);
}

When the Google test root of broadcast is 1/2/3, what is the problem of use case stuck?

@wenkaidu

param not work for me

I tried --display-topo, seems not work for me this time.
The command I use: /mpirun -n 2 -mca pml ucx -x UCX_NET_DEVICES=eno1,eno2 --display-topo -display-map -host mgpu1:1,mgpu2:1 --display-allocation -output-filename log.log ./all_reduce_perf -g 2 -b 100 -e 4000000

Anybody know this?

How to call the specified IB port?

How to customize ring to call the specified IB port during inter node communication.
Thank you very much!

what is Chordal Ring in RCCL?

I found that rccl support many rome models, and there is a best performance ring ChordalRing, what is this? Why this ring's performance is the best of all?

Distributed data parallel training stalls with ROCm 5.0.2

Distributed data parallel training hangs using pytorch with the ROCm 5.0.2 release when training on more than one node.

I am testing on OLCF crusher with eight MI250x GPUs in one node https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html

However, the behavior can be exhibited when only using 1 GPU per node on two nodes.

Minimal reproducer script (harness.py)

import os
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.optim as optim

if __name__ == '__main__':
    local_rank = 0
    if 'LOCAL_RANK' in os.environ:
        local_rank = int(os.environ["LOCAL_RANK"])

    torch.cuda.set_device(local_rank)
    print('Initialized GPU {}'.format(local_rank))

    dist.init_process_group(backend='nccl')
    print('Initialized process group on rank {}'.format(dist.get_rank()))

    rank = dist.get_rank()
    tensor_sizes = []
    k = 1
    for i in range(16):
        tensor_sizes.append(k)
        k*=2

    for tensor_size in tensor_sizes:
        print(tensor_size)
        model = torch.nn.Linear(tensor_size, tensor_size).to(local_rank)
        ddp_model = DDP(model, device_ids=[local_rank])
        optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

        tensor = torch.tensor(range(tensor_size), dtype=torch.float32).cuda(
            local_rank)
        ddp_model(tensor).sum().backward()
        optimizer.step()

Launcher (run_harness.sh)

torchrun --nproc_per_node 1 --nnodes 2 --rdzv_backend=c10d --rdzv_endpoint=$1:29400 harness.py ${@:2}

Output (on a two node allocation):

$ ip a # show IP of preferred network interface on the master node
$ srun sh run_harness.sh <master_ip>
# I am omitting some messages related to torchrun
Initialized GPU 0
Initialized GPU 0
Initialized process group on rank 0
Initialized process group on rank 1
1
1
2
2
4
4
8
8
16
16
32
32
64
64
128
128
^C

The process stalls after a few iterations.

It is possible to get the loop to finish by setting the environment variable NCCL_PROTO=Simple.

Digging deeper, the hang was introduced between ROCm 4.5.2 and 5.0.2, in this commit:
565fbeb

Specifically, the change in BROADCAST_CHUNKSTEPS from 1 to 2 causes the problem.

Versions:

pytorch: 1.12.0a0+git9429dbb (master branch)
- compilation command line:
  - CXX=g++ CC=gcc CXXFLAGS=-lncurses PYTORCH_ROCM_ARCH="gfx90a" USE_MPI=0 USE_ROCM=1 python setup.py bdist_wheel --verbose
ROCm: 5.0.2 (the problem still exists in RCCL branch rocm-5.1.0)

Debug output (NCCL_DEBUG=info):

crusher002:81767:81767 [0] NCCL INFO Bootstrap : Using bond0:100.65.2.2<0>
crusher002:81767:81767 [0] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
crusher002:81767:81767 [0] NCCL INFO NET/IB : No device found.
crusher002:81767:81767 [0] NCCL INFO NET/Socket : Using [0]bond0:100.65.2.2<0> [1]hsn0:10.129.0.13<0> [2]hsn1:10.129.0.14<0> [3]hsn2:10.129.0.15<0> [4]hsn3:10.129.0.16<0>
crusher002:81767:81767 [0] NCCL INFO Using network Socket
RCCL version 2.10.3+hip5.0
crusher003:24359:24359 [0] NCCL INFO Bootstrap : Using bond0:100.65.2.3<0>
crusher003:24359:24359 [0] NCCL INFO NET/Plugin : No plugin found (librccl-net.so), using internal implementation
crusher003:24359:24359 [0] NCCL INFO NET/IB : No device found.
crusher003:24359:24359 [0] NCCL INFO NET/Socket : Using [0]bond0:100.65.2.3<0> [1]hsn0:10.129.0.17<0> [2]hsn1:10.129.0.18<0> [3]hsn2:10.129.0.19<0> [4]hsn3:10.129.0.20<0>
crusher003:24359:24359 [0] NCCL INFO Using network Socket
crusher002:81767:81779 [0] NCCL INFO Not performing bootstrap root for clique kernels as clique mode not enabled.
crusher002:81767:81780 [0] NCCL INFO rocm_smi_lib: version 5.0.0.0
crusher003:24359:24369 [0] NCCL INFO rocm_smi_lib: version 5.0.0.0
crusher002:81767:81780 [0] NCCL INFO Clique kernels disabled
crusher003:24359:24369 [0] NCCL INFO Clique kernels disabled
crusher003:24359:24369 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1 comm 0x7f9e80000ef0 nRanks 02 busId c1000
crusher003:24359:24369 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c1000] [receive] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c1000] [receive] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c1000] [receive] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c1000] [receive] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 00/04 :    0   1
crusher002:81767:81780 [0] NCCL INFO Channel 01/04 :    0   1
crusher002:81767:81780 [0] NCCL INFO Channel 02/04 :    0   1
crusher002:81767:81780 [0] NCCL INFO Channel 03/04 :    0   1
crusher002:81767:81780 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1 comm 0x7f6400000ef0 nRanks 02 busId c1000
crusher002:81767:81780 [0] NCCL INFO Channel 00 : 1[c1000] -> 0[c1000] [receive] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 01 : 1[c1000] -> 0[c1000] [receive] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 02 : 1[c1000] -> 0[c1000] [receive] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 03 : 1[c1000] -> 0[c1000] [receive] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 00 : 0[c1000] -> 1[c1000] [send] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 01 : 0[c1000] -> 1[c1000] [send] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 00 : 1[c1000] -> 0[c1000] [send] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 02 : 0[c1000] -> 1[c1000] [send] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 01 : 1[c1000] -> 0[c1000] [send] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Channel 03 : 0[c1000] -> 1[c1000] [send] via NET/Socket/1 comm 0x7f6400000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 02 : 1[c1000] -> 0[c1000] [send] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher003:24359:24369 [0] NCCL INFO Channel 03 : 1[c1000] -> 0[c1000] [send] via NET/Socket/1 comm 0x7f9e80000ef0 nRanks 02
crusher002:81767:81780 [0] NCCL INFO Connected all rings comm 0x7f6400000ef0 nRanks 02 busId c1000
crusher002:81767:81780 [0] NCCL INFO Connected all trees comm 0x7f6400000ef0 nRanks 02 busId c1000
crusher003:24359:24369 [0] NCCL INFO Connected all rings comm 0x7f9e80000ef0 nRanks 02 busId c1000
crusher003:24359:24369 [0] NCCL INFO Connected all trees comm 0x7f9e80000ef0 nRanks 02 busId c1000
crusher003:24359:24369 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
crusher003:24359:24369 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
crusher002:81767:81780 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
crusher002:81767:81780 [0] NCCL INFO 4 coll channels, 4 p2p channels, 1 p2p channels per peer
crusher003:24359:24369 [0] NCCL INFO comm 0x7f9e80000ef0 rank 1 nranks 2 cudaDev 0 busId c1000 used 27440 bytes - Init COMPLETE
crusher002:81767:81780 [0] NCCL INFO comm 0x7f6400000ef0 rank 0 nranks 2 cudaDev 0 busId c1000 used 27440 bytes - Init COMPLETE
crusher002:81767:81767 [0] NCCL INFO Launch mode Parallel/CGMD

Performance for multi process in one node with PCIE Switch

Two processes, and each process control one GPU.

Two GPUs are connected via one PCIE Switch.
build/all_reduce_perf --minbytes=16 --maxbytes=262144000 --parallel_init=0 --warmup_iters=10 --stepfactor=2 --iters=1000

size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
 (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   16             4   float     sum    25.81    0.00    0.00  0e+00    26.80    0.00    0.00  0e+00
   32             8   float     sum    26.77    0.00    0.00  0e+00    26.74    0.00    0.00  0e+00
   64            16   float     sum    27.03    0.00    0.00  0e+00    27.05    0.00    0.00  0e+00
  128            32   float     sum    27.05    0.00    0.00  0e+00    27.02    0.00    0.00  0e+00
  256            64   float     sum    27.12    0.01    0.01  0e+00    27.16    0.01    0.01  0e+00
  512           128   float     sum    27.39    0.02    0.02  0e+00    27.36    0.02    0.02  0e+00
 1024           256   float     sum    27.74    0.04    0.04  0e+00    27.68    0.04    0.04  0e+00
 2048           512   float     sum    28.33    0.07    0.07  0e+00    28.30    0.07    0.07  0e+00
 4096          1024   float     sum    43.39    0.09    0.09  0e+00    43.29    0.09    0.09  0e+00
 8192          2048   float     sum    30.88    0.27    0.27  0e+00    30.85    0.27    0.27  0e+00
16384          4096   float     sum    32.19    0.51    0.51  0e+00    32.14    0.51    0.51  0e+00
32768          8192   float     sum    42.48    0.77    0.77  0e+00    42.40    0.77    0.77  0e+00
65536         16384   float     sum    62.45    1.05    1.05  0e+00    62.29    1.05    1.05  0e+00
131072         32768   float     sum    101.5    1.29    1.29  0e+00    101.4    1.29    1.29  0e+00
262144         65536   float     sum    103.6    2.53    2.53  0e+00    103.3    2.54    2.54  0e+00
524288        131072   float     sum    171.3    3.06    3.06  0e+00    171.4    3.06    3.06  0e+00
1048576        262144   float     sum    313.0    3.35    3.35  0e+00    313.1    3.35    3.35  0e+00
2097152        524288   float     sum    593.6    3.53    3.53  0e+00    593.4    3.53    3.53  0e+00
4194304       1048576   float     sum   1154.7    3.63    3.63  0e+00   1153.1    3.64    3.64  0e+00
8388608       2097152   float     sum   2264.0    3.71    3.71  0e+00   2266.8    3.70    3.70  0e+00
16777216       4194304   float     sum   4463.2    3.76    3.76  0e+00   4463.6    3.76    3.76  0e+00
33554432       8388608   float     sum   8844.7    3.79    3.79  0e+00   8840.8    3.80    3.80  0e+00

Two GPUs are connected via CPU without PCIE Switch.

size         count    type   redop     time   algbw   busbw  error     time   algbw   busbw  error
 (B)    (elements)                     (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
   16             4   float     sum    43.43    0.00    0.00  0e+00    43.32    0.00    0.00  0e+00
   32             8   float     sum    43.26    0.00    0.00  0e+00    43.25    0.00    0.00  0e+00
   64            16   float     sum    43.25    0.00    0.00  0e+00    43.32    0.00    0.00  0e+00
  128            32   float     sum    43.29    0.00    0.00  0e+00    43.28    0.00    0.00  0e+00
  256            64   float     sum    43.48    0.01    0.01  0e+00    43.39    0.01    0.01  0e+00
  512           128   float     sum    43.72    0.01    0.01  0e+00    43.75    0.01    0.01  0e+00
 1024           256   float     sum    44.22    0.02    0.02  0e+00    44.20    0.02    0.02  0e+00
 2048           512   float     sum    45.04    0.05    0.05  0e+00    45.01    0.05    0.05  0e+00
 4096          1024   float     sum    49.23    0.08    0.08  0e+00    49.60    0.08    0.08  0e+00
 8192          2048   float     sum    49.88    0.16    0.16  0e+00    49.97    0.16    0.16  0e+00
16384          4096   float     sum    57.50    0.28    0.28  0e+00    57.47    0.29    0.29  0e+00
32768          8192   float     sum    72.37    0.45    0.45  0e+00    72.35    0.45    0.45  0e+00
65536         16384   float     sum    104.0    0.63    0.63  0e+00    104.0    0.63    0.63  0e+00
131072         32768   float     sum    170.3    0.77    0.77  0e+00    170.1    0.77    0.77  0e+00
262144         65536   float     sum    92.33    2.84    2.84  0e+00    92.37    2.84    2.84  0e+00
524288        131072   float     sum    132.0    3.97    3.97  0e+00    132.1    3.97    3.97  0e+00
1048576        262144   float     sum    223.4    4.69    4.69  0e+00    223.8    4.68    4.68  0e+00
2097152        524288   float     sum    381.5    5.50    5.50  0e+00    381.5    5.50    5.50  0e+00
4194304       1048576   float     sum    700.5    5.99    5.99  0e+00    698.8    6.00    6.00  0e+00
8388608       2097152   float     sum   1370.1    6.12    6.12  0e+00   1366.8    6.14    6.14  0e+00
16777216       4194304   float     sum   2640.7    6.35    6.35  0e+00   2642.4    6.35    6.35  0e+00
33554432       8388608   float     sum   5169.0    6.49    6.49  0e+00   5163.0    6.50    6.50  0e+00

I think 1 and 2 does not achive a good performance for PCIE Gen3 x16, and lower performance achived when connected with one PCIE Switch.

This is a question: RCCL seems compatible with HIP, But can I use it with ROCm+OpenCL?

Question was previously asked here without reply yet.

I'm working to add GPU to GPU communication for ML library that uses OpenCL when using AMD.
Can I use RCCL?
RCCL uses HIP streams, while my library is using OpenCL streams. Are they compatible / synchronized ?
Can one stream be can be used on the CPU as both HIP and OpenCL stream?

ROCMConfig.cmake

Could not find a package configuration file provided by "ROCM" with any of
the following names:

ROCMConfig.cmake
rocm-config.cmake

Add the installation prefix of "ROCM" to CMAKE_PREFIX_PATH or set
"ROCM_DIR" to a directory containing one of the above files. If "ROCM"
provides a separate development package or SDK, be sure it has been
installed.

How to solve this problem.

validation tests might not be reporting mismatches

The validate function returns a bool, however, not one of the uses of the function ever checks the return value to indicate a failure.