microsoft / msccl Goto Github PK

Microsoft Collective Communication Library

License: Other

Makefile 1.51% C 23.97% Shell 0.18% C++ 72.49% Cuda 0.80% Dockerfile 0.28% Python 0.77%

msccl's Introduction

MSCCL

Microsoft Collective Communication Library (MSCCL) is a platform to execute custom collective communication algorithms for multiple accelerators supported by Microsoft Azure.

Introduction

MSCCL is an inter-accelerator communication framework that is built on top of NCCL and uses its building blocks to execute custom-written collective communication algorithms. MSCCL vision is to provide a unified, efficient, and scalable framework for executing collective communication algorithms across multiple accelerators. To achieve this, MSCCL has multiple capabilities:

Programmibility: Inter-connection among accelerators have different latencies and bandwidths. Therefore, a generic collective communication algorithm does not necessarily well for all topologies and buffer sizes. MSCCL allows a user to write a hyper-optimized collective communication algorithm for a given topology and a buffer size. This is possbile through two main components: MSCCL toolkit and MSCCL runtime (this repo). MSCCL toolkit contains a high-level DSL (MSCCLang) and a compiler which generate an IR for the MSCCL runtime (this repo) to run on the backend. MSCCL will always automatically fall back to a NCCL's generic algorithm in case there is no custom algorithm. Example provides some instances on how MSCCL toolkit with the runtime works. Please refer to MSCCL toolkit for more information.
Profiling: MSCCL has a profiling tool NPKit which provides detailed timeline for each primitive send and receive operation to understand the bottlenecks in a given collective communication algorithms.

MSCCL is the product of many great researchers and interns at Microsoft Research. Below is a list of our publications:

GC3: An Optimizing Compiler for GPU Collective Communication -- ASPLOS'23
Synthesizing optimal collective algorithms -- PPoPP'21 (Best Paper Award)
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads -- ASPLOS'22
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches -- NSDI'23

Please consider citing our work if you use MSCCL in your research. Also, please contact us if you have any questions or need an optimized collective communication algorithm for a specific topology.

Example

In order to use MSCCL customized algorithms, you may follow these steps to use two different MSCCL algorithms for AllReduce on Azure NDv4 which has 8xA100 GPUs:

Steps to install MSCCL:

$ git clone https://github.com/microsoft/msccl.git
$ cd msccl/
$ make -j src.build
$ cd ../

Then, follow these steps to install nccl-tests for performance evaluation:

$ git clone https://github.com/nvidia/nccl-tests.git
$ cd nccl-tests/
$ make MPI=1 NCCL_HOME=../msccl/build/ -j 
$ cd ../

Next install MSCCL toolkit to compile a few custom algorithms:

$ git clone https://github.com/microsoft/msccl-tools.git
$ cd msccl-tools/
$ pip install .
$ cd ../
$ python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml
$ cd ../

The compiler's generated code is an XML file (test.xml) that is fed to MSCCL runtime. To evaluate its performance, execute the following command line on an Azure NDv4 node or any 8xA100 system:

$ mpirun -np 8 -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=test.xml -x NCCL_ALGO=MSCCL,RING,TREE  nccl-tests/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

If everything is installed correctly, you should see the following output in log:

[0] NCCL INFO Connected 1 MSCCL algorithms

test.xml is passed in to the runtime by MSCCL_XML_FILES in the command line. You may evaluate the performance of test.xml by comparing in-place (the new algorithm) vs out-of-place (default ring algorithm) and it should up-to 2-3x faster on 8xA100 NVLink-interconnected GPUs. MSCCL toolkit has a rich set of algorithms for different Azure SKUs and collective operations with significant speedups over vanilla NCCL.

Build

To build the library:

$ cd msccl
$ make -j src.build

If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :

$ make src.build CUDA_HOME=<path to cuda install>

MSCCL will be compiled and installed in build/ unless BUILDDIR is set.

By default, MSCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :

$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_80,code=sm_80"

Install

To install MSCCL on the system, create a package then install it as root.

Debian/Ubuntu :

$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/

RedHat/CentOS :

$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/

OS-agnostic tarball :

$ make pkg.txz.build
$ ls build/pkg/txz/

PyTorch Integration

For integration with PyTorch, follow the dockerfile in this repo. It has an example for how to replace default NCCL with MSCCL.

NPKit Integration

MSCCL integrates NPKit, a profiler framework that enables collecting fine-grained trace events in MSCCL components that identifies transmission bottlenecks.

To Enable NPKit, simply add NPKIT=1 along with your make command. During execution, environment variable NPKIT_DUMP_DIR will be used to produce all of the output (one output file per rank). By default, /tmp/ will be used.

To analyze NPKit output, run python script tools/npkit_trace_generator.py to get the final .json file which can be viewed by a trace viewer such as Microsoft Edge edge://tracing or Google Chrome chrome://tracing.

Copyright

msccl's People

Contributors

Stargazers

Watchers

msccl's Issues

Schedule fails at large size

Hi, we have observed that the following allreduce schedule test.txt fails only at large size (16777216B) with branch msccl/merging_with_nccl. We observed this bug in our testbed when there are large number of nodes (8+), with large nchunksperloop, and at large sizes. The other schedule test2.txt attached is the same schedule but with smaller nchunksperloop. test2.txt succeeds the datacheck.

test.txt
test2.txt

We have modify the coll of test.txt to test with all_reduce_perf, reduce_scatter_perf, and all_gather_perf. The results are as followed:

`all_reduce_perf`

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        8192          2048     float     sum    39.32    0.21    0.36  2e-07
       16384          4096     float     sum    39.02    0.42    0.73  2e-07
       32768          8192     float     sum    39.24    0.84    1.46  2e-07
       65536         16384     float     sum    39.54    1.66    2.90  2e-07
      131072         32768     float     sum    41.66    3.15    5.51  5e-07
      262144         65536     float     sum    45.30    5.79   10.13  5e-07
      524288        131072     float     sum    53.41    9.82   17.18  5e-07
     1048576        262144     float     sum    71.83   14.60   25.55  5e-07
     2097152        524288     float     sum    112.6   18.63   32.61  5e-07
     4194304       1048576     float     sum    175.7   23.87   41.77  5e-07
     8388608       2097152     float     sum    314.7   26.66   46.65  5e-07
    16777216       4194304     float     sum    600.5   27.94   48.89  2e+00
# Out of bounds values : 2 FAILED
# Avg bus bandwidth    : 19.4778 
#

`reduce_scatter_perf`

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
        8192           256     float     sum    39.58    0.21    0.18  2e-07
       16384           512     float     sum    38.75    0.42    0.37  2e-07
       32768          1024     float     sum    39.41    0.83    0.73  2e-07
       65536          2048     float     sum    39.75    1.65    1.44  2e-07
      131072          4096     float     sum    41.78    3.14    2.74  5e-07
      262144          8192     float     sum    44.95    5.83    5.10  5e-07
      524288         16384     float     sum    53.11    9.87    8.64  5e-07
     1048576         32768     float     sum    72.23   14.52   12.70  5e-07
     2097152         65536     float     sum    112.1   18.71   16.37  5e-07
     4194304        131072     float     sum    201.5   20.81   18.21  5e-07
     8388608        262144     float     sum    330.6   25.37   22.20  5e-07
    16777216        524288     float     sum    602.1   27.87   24.38  5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.42284 
#

`all_gather_perf`

g1lmd6: Test CUDA failure common.cu:506 'an illegal memory access was encountered'
 .. g1lmd6 pid 1269293: Test failure common.cu:607
 .. g1lmd6 pid 1269293: Test failure common.cu:798
 .. g1lmd6 pid 1269293: Test failure all_gather.cu:93
 .. g1lmd6 pid 1269293: Test failure common.cu:824
 .. g1lmd6 pid 1269293: Test failure common.cu:1199
 .. g1lmd6 pid 1269293: Test failure common.cu:1039

We also tried to change test.txt to Simple and LL128 protocols. LL128 is fine, but a divide-by-zero error is observed with Simple. test2.txt works perfectly with any protocol.

In summary, we observed the following bugs:

test.txt fails datacheck at large size.
test.txt cannot run with all_gather_perf.
test.txt has divide-by-zero error when changed to Simple protocol.

Interesting dip in runtime

Hi, we have observed an interesting phenomenon with multi-channel/multi-instance schedules. When experimenting the schedules with different sizes, we observed a short dip in runtime around 1MB as size increases. The phenomenon has been observed across almost all schedules. We wonder if there is an explanation for this.

The following plot shows the phenomenon from 8-node ring allreduce schedules with Simple protocol:

The LL has a similar but less significant dip:

The above results are obtained from our testbed. We are also able to reproduce this dip on NVSwitch system:

AR_OptNaiveBiRing_2_8_n8_c8_tb16_cf1_Simple_v2.xml:

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         512           128     float     sum    169.8    0.00    0.01  1e-07
        1024           256     float     sum    170.3    0.01    0.01  2e-07
        2048           512     float     sum    165.9    0.01    0.02  2e-07
        4096          1024     float     sum    159.7    0.03    0.04  2e-07
        8192          2048     float     sum    154.1    0.05    0.09  2e-07
       16384          4096     float     sum    154.3    0.11    0.19  2e-07
       32768          8192     float     sum    153.7    0.21    0.37  2e-07
       65536         16384     float     sum    150.2    0.44    0.76  2e-07
      131072         32768     float     sum    144.6    0.91    1.59  2e-07
      262144         65536     float     sum    145.7    1.80    3.15  2e-07
      524288        131072     float     sum    153.4    3.42    5.98  2e-07
     1048576        262144     float     sum    155.5    6.74   11.80  2e-07
     2097152        524288     float     sum    161.4   13.00   22.74  2e-07
     4194304       1048576     float     sum    173.6   24.15   42.27  2e-07
     8388608       2097152     float     sum    193.3   43.40   75.95  2e-07
    16777216       4194304     float     sum    235.2   71.35  124.85  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 18.1146 
#

AR_OptNaiveBiRing_2_8_n8_c4_tb8_cf1_Simple_v2.xml:

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         256            64     float     sum    156.2    0.00    0.00  1e-07
         512           128     float     sum    156.1    0.00    0.01  1e-07
        1024           256     float     sum    153.7    0.01    0.01  2e-07
        2048           512     float     sum    147.4    0.01    0.02  2e-07
        4096          1024     float     sum    145.6    0.03    0.05  2e-07
        8192          2048     float     sum    135.9    0.06    0.11  2e-07
       16384          4096     float     sum    129.0    0.13    0.22  2e-07
       32768          8192     float     sum    127.8    0.26    0.45  2e-07
       65536         16384     float     sum    126.7    0.52    0.91  2e-07
      131072         32768     float     sum    126.7    1.03    1.81  2e-07
      262144         65536     float     sum    134.6    1.95    3.41  2e-07
      524288        131072     float     sum    136.6    3.84    6.72  2e-07
     1048576        262144     float     sum    141.5    7.41   12.97  2e-07
     2097152        524288     float     sum    154.6   13.57   23.74  2e-07
     4194304       1048576     float     sum    171.3   24.48   42.84  2e-07
     8388608       2097152     float     sum    224.2   37.41   65.47  2e-07
    16777216       4194304     float     sum    315.0   53.26   93.20  2e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 14.8194 
#

The schedules are attached:

1 channel: AR_OptNaiveBiRing_2_8_n8_c1_tb2_cf1_Simple_v2.txt
2 channel: AR_OptNaiveBiRing_2_8_n8_c2_tb4_cf1_Simple_v2.txt
4 channel: AR_OptNaiveBiRing_2_8_n8_c4_tb8_cf1_Simple_v2.txt
8 channel: AR_OptNaiveBiRing_2_8_n8_c8_tb16_cf1_Simple_v2.txt

Could you also reproduce this phenomenon with your multi-channel/multi-instance schedules?

How to use test.xml algo.?

Hi, I have successfully setup the msccl-tools and I can generate the .xml algo files from the examples as described here. Another question is how to use the synthesized .json files here

python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 8 2 > test.xml. My quick question is
how to enforce the usage of text.xml in the

or python msccl-tools/examples/mscclang/allreduce_a100_ring.py --protocol=Simple 8 1 1 > test.xml
and copying the text.xml to /msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms

mpirun -np 1 -x LD_LIBRARY_PATH=msccl/exector/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_ALGO=TREE -x NCCL_DEBUG_SUBSYS=INIT,ENV msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 32MB -f 2 -g 8 -c 1 -n 100 -w 100 -G 100 -z 0

Because it seems to ignore it and falling back to nccl. Thanks in advance for your help. Following is the topology I used with 8xA100 GPUs and 1 node

       GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7CPU Affinity    NUMA Affinity
GPU0     X      NV12    NV12    NV12    NV12    NV12    NV12    NV120-31    0
GPU1    NV12     X      NV12    NV12    NV12    NV12    NV12    NV120-31    0
GPU2    NV12    NV12     X      NV12    NV12    NV12    NV12    NV120-31    0
GPU3    NV12    NV12    NV12     X      NV12    NV12    NV12    NV120-31    0
GPU4    NV12    NV12    NV12    NV12     X      NV12    NV12    NV1232-63   1
GPU5    NV12    NV12    NV12    NV12    NV12     X      NV12    NV1232-63   1
GPU6    NV12    NV12    NV12    NV12    NV12    NV12     X      NV1232-63   1
GPU7    NV12    NV12    NV12    NV12    NV12    NV12    NV12     X 32-63    1

Legend:

 X    = Self
 SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
 NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
 PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
 PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
 PIX  = Connection traversing at most a single PCIe bridge
 NV#  = Connection traversing a bonded set of # NVLinks

The issue is no matter what test.xml I used I get the same in-place and out-of-place results:

#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         128            32     float     sum      -1    19.96    0.01    0.01      0    17.59    0.01    0.01      0
         256            64     float     sum      -1    17.75    0.01    0.03      0    17.78    0.01    0.03      0
         512           128     float     sum      -1    18.58    0.03    0.05      0    18.61    0.03    0.05      0
        1024           256     float     sum      -1    18.88    0.05    0.09      0    18.86    0.05    0.10      0
        2048           512     float     sum      -1    19.63    0.10    0.18      0    19.57    0.10    0.18      0
        4096          1024     float     sum      -1    21.03    0.19    0.34      0    21.01    0.19    0.34      0
        8192          2048     float     sum      -1    21.98    0.37    0.65      0    22.02    0.37    0.65      0
       16384          4096     float     sum      -1    22.44    0.73    1.28      0    22.48    0.73    1.28      0
       32768          8192     float     sum      -1    23.17    1.41    2.47      0    23.13    1.42    2.48      0
       65536         16384     float     sum      -1    24.22    2.71    4.74      0    24.14    2.72    4.75      0
      131072         32768     float     sum      -1    28.36    4.62    8.09      0    27.79    4.72    8.25      0
      262144         65536     float     sum      -1    44.06    5.95   10.41      0    43.96    5.96   10.44      0
      524288        131072     float     sum      -1    51.40   10.20   17.85      0    51.30   10.22   17.88      0
     1048576        262144     float     sum      -1    60.20   17.42   30.48      0    60.16   17.43   30.50      0
     2097152        524288     float     sum      -1    73.98   28.35   49.61      0    73.65   28.48   49.83      0
     4194304       1048576     float     sum      -1    100.2   41.86   73.25      0    99.04   42.35   74.11      0
     8388608       2097152     float     sum      -1    151.0   55.56   97.23      0    148.8   56.38   98.66      0
    16777216       4194304     float     sum      -1    254.9   65.81  115.16      0    254.6   65.89  115.31      0
    33554432       8388608     float     sum      -1    464.5   72.24  126.43      0    464.7   72.20  126.36      0

# Out of bounds values : 0 OK
# Avg bus bandwidth    : 28.4095

I have doubled checked even if we remove the text.xml, the INFO log still prints CCL INFO NCCL_ALGO set by environment to msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms/test.xml.

Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing

While running the Multi-gpu Pytorch tests, test_all_reduce_coalesced_nccl is failing in pytorch/test/test_c10d_nccl.py. It seems like the error is coming because of inconsistent results from allreduce. The information on the logs is as follows:
171495ffc000000:237471:237471 [0] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237471:237471 [0] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237471:237471 [0] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237471:237471 [0] NCCL INFO Using network Socket
NCCL version 2.12.12.MSCCL.0.1+cuda11.3
171495ffc000000:237471:237535 [0] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x203400000
171495ffc000000:237472:237472 [1] NCCL INFO Bootstrap : Using eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
171495ffc000000:237472:237472 [1] NCCL INFO Failed to open libibverbs.so[.1]
171495ffc000000:237472:237472 [1] NCCL INFO NET/Socket : Using [0]eth0:10.1.0.4<0>
171495ffc000000:237472:237472 [1] NCCL INFO Using network Socket
171495ffc000000:237472:237536 [1] NCCL INFO init.cc:233 Cuda Host Alloc Size 4 pointer 0x206800000
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0001-0000-3130-444531334632/pci0001:00/0001:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_speed, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection : could not read /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/47505500-0002-0000-3130-444531334632/pci0002:00/0002:00:00.0/../max_link_width, ignoring
171495ffc000000:237471:237535 [0] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237472:237536 [1] NCCL INFO Topology detection: network path /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/000d3a1f-1dcf-000d-3a1f-1dcf000d3a1f is not a PCI device (vmbus). Attaching to first CPU
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237472:237536 [1] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO KV Convert to int : could not find value of '' in dictionary, falling back to 60
171495ffc000000:237471:237535 [0] NCCL INFO === System : maxWidth 12.0 totalWidth 12.0 ===

Additional error info:
ERROR:torch.testing._internal.common_distributed:Caught exception:
Traceback (most recent call last):
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 601, in run_test
getattr(self, test_name)()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 486, in wrapper
fn()
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 3098, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_distributed.py", line 131, in wrapper
return func(*args, **kwargs)
File "/mnt/vss/_work/1/s/test/pytorch/test/distributed/test_c10d_nccl.py", line 2867, in test_all_reduce_coalesced_nccl
self.assertEqual(t, torch.full_like(t, self.world_size * (i + (self.world_size + 1.) / 2.)))
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_internal/common_utils.py", line 2121, in assertEqual
assert_equal(
File "/mnt/vss/_work/1/s/test/pytorch/torch/testing/_comparison.py", line 1080, in assert_equal
raise error_metas[0].to_error()
AssertionError: Tensor-likes are not close!
Mismatched elements: 60 / 60 (100.0%)
Greatest absolute difference: 1.0 at index 0 (up to 1e-05 allowed)
Greatest relative difference: 0.3333333333333333 at index 0 (up to 1.3e-06 allowed)

exiting process 1 with exit code: 10

Datacheck fails only with `LL` protocol

Hi, the attached 6-node allreduce schedule fails the datacheck with LL protocol and at small sizes. However, the datacheck is fine when protocol is Simple or LL128.

schedule.txt

LL Output (datacheck fails at the first three sizes):

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         192            48     float     sum    29.73    0.01    0.01  9e-02
         384            96     float     sum    28.63    0.01    0.02  5e-02
         768           192     float     sum    28.14    0.03    0.05  2e-02
        1536           384     float     sum    24.35    0.06    0.11  2e-07
        3072           768     float     sum    24.96    0.12    0.21  2e-07
        6144          1536     float     sum    27.63    0.22    0.37  2e-07
       12288          3072     float     sum    28.13    0.44    0.73  5e-07
       24576          6144     float     sum    29.04    0.85    1.41  5e-07
       49152         12288     float     sum    30.66    1.60    2.67  5e-07
       98304         24576     float     sum    33.85    2.90    4.84  5e-07
      196608         49152     float     sum    41.39    4.75    7.92  5e-07
      393216         98304     float     sum    59.40    6.62   11.03  5e-07
      786432        196608     float     sum    92.67    8.49   14.14  5e-07
     1572864        393216     float     sum    160.1    9.82   16.37  5e-07
     3145728        786432     float     sum    307.9   10.22   17.03  5e-07
     6291456       1572864     float     sum    603.1   10.43   17.39  5e-07
# Out of bounds values : 18 FAILED
# Avg bus bandwidth    : 5.89321 
#

Simple Output:

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         192            48     float     sum    68.20    0.00    0.00  1e-07
         384            96     float     sum    68.03    0.01    0.01  1e-07
         768           192     float     sum    67.89    0.01    0.02  6e-08
        1536           384     float     sum    67.81    0.02    0.04  2e-07
        3072           768     float     sum    67.59    0.05    0.08  2e-07
        6144          1536     float     sum    64.17    0.10    0.16  2e-07
       12288          3072     float     sum    66.96    0.18    0.31  5e-07
       24576          6144     float     sum    69.75    0.35    0.59  5e-07
       49152         12288     float     sum    79.26    0.62    1.03  5e-07
       98304         24576     float     sum    79.49    1.24    2.06  5e-07
      196608         49152     float     sum    80.89    2.43    4.05  5e-07
      393216         98304     float     sum    87.15    4.51    7.52  5e-07
      786432        196608     float     sum    107.8    7.30   12.16  5e-07
     1572864        393216     float     sum    152.9   10.29   17.15  5e-07
     3145728        786432     float     sum    230.5   13.65   22.75  5e-07
     6291456       1572864     float     sum    389.0   16.17   26.96  5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 5.93026 
#

LL128 Output:

#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
         192            48     float     sum    54.57    0.00    0.01  1e-07
         384            96     float     sum    54.44    0.01    0.01  1e-07
         768           192     float     sum    54.44    0.01    0.02  6e-08
        1536           384     float     sum    54.54    0.03    0.05  2e-07
        3072           768     float     sum    54.48    0.06    0.09  2e-07
        6144          1536     float     sum    72.97    0.08    0.14  2e-07
       12288          3072     float     sum    72.51    0.17    0.28  5e-07
       24576          6144     float     sum    73.21    0.34    0.56  5e-07
       49152         12288     float     sum    74.69    0.66    1.10  5e-07
       98304         24576     float     sum    77.22    1.27    2.12  5e-07
      196608         49152     float     sum    82.67    2.38    3.96  5e-07
      393216         98304     float     sum    92.37    4.26    7.10  5e-07
      786432        196608     float     sum    99.20    7.93   13.21  5e-07
     1572864        393216     float     sum    137.2   11.47   19.11  5e-07
     3145728        786432     float     sum    216.0   14.57   24.28  5e-07
     6291456       1572864     float     sum    372.2   16.90   28.17  5e-07
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 6.26362 
#

Meanwhile, when changing the coll to allgather or reduce_scatter, NCCL fails to start with the following error when running all_gather_perf and reduce_scatter_perf.

g1lmd6:2758544:2758544 [3] enqueue.cc:346 NCCL WARN Error : no algorithm/protocol available
g1lmd6: Test NCCL failure reduce_scatter.cu:58 'internal error'
 .. g1lmd6 pid 2758544: Test failure common.cu:583
 .. g1lmd6 pid 2758544: Test failure common.cu:796
 .. g1lmd6 pid 2758544: Test failure reduce_scatter.cu:104
 .. g1lmd6 pid 2758544: Test failure common.cu:824
 .. g1lmd6 pid 2758544: Test failure common.cu:1199
 .. g1lmd6 pid 2758544: Test failure common.cu:1039

Grammar error in README.md

Please correct this to avoid confusion.
https://github.com/microsoft/msccl#:~:text=MSCCL%20will%20always%20MSCCL%20will%20automatically%20fall%20back%20to%20a%20NCCL%27s%20generic%20algorithm%20in%20case%20there%20is%20no%20custom%20algorithm.

This repo is missing important files

There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

Merge this pull request

A Doubt about the Source Code of MSCCL

I am currently studying the source code of MSCCL, I have some doubts about the count variable in this structure. What does it mean?

// TODO: compress this by a lot!
struct mscclTransfer {
  int16_t srcoffset;
  int16_t dstoffset;
  uint8_t srcbuffer; // follow MSCCL_THIS_INPUT/MSCCL_THIS_OUTPUT macros
  uint8_t dstbuffer; // follow MSCCL_THIS_INPUT/MSCCL_THIS_OUTPUT macros
  int16_t depencePointer; // index to the first dependence
  int16_t numDependences; // depencePointer+numDependences indicate the last dependence
  int8_t has_dependence;
  int16_t numReductions; // number of reductions with the same dst
  int16_t reductionPointer; // where the reduction starts
  uint8_t type;
  //--------------//
  uint8_t count; //
  //--------------//
};

Schedule fails datacheck in `origin/reduction_in_prims`

Hi, the following reduce-scatter schedule works in master but fails the datacheck in reduction_in_prims branch.

<algo name="test" proto="Simple" nchannels="1" nchunksperloop="4" ngpus="2" coll="reduce_scatter" inplace="1">
  <gpu id="0" i_chunks="4" o_chunks="0" s_chunks="2">
    <tb id="0" send="1" recv="1" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="2" dstbuf="s" dstoff="0" cnt="2" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="0" cnt="2" depid="-1" deps="-1" hasdep="0"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="0" cnt="2" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="1" i_chunks="4" o_chunks="0" s_chunks="2">
    <tb id="0" send="0" recv="0" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="0" cnt="2" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="2" dstbuf="s" dstoff="0" cnt="2" depid="-1" deps="-1" hasdep="0"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="2" cnt="2" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
</algo>

mpirun -np 2 -x NCCL_NET_SHARED_BUFFERS=0  -x MSCCL_XML_FILES=schedule/schedule.xml -x NCCL_ALGO=MSCCL -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -x NCCL_DEBUG=WARN -x NCCL_DEBUG_SUBSYS=ALL -x LD_LIBRARY_PATH=~/msccl/build/lib/:$LD_LIBRARY_PATH ~/nccl-tests/build/reduce_scatter_perf -b 16 -e 1048576 -f 2 -g 1 -c 1 -n 200 -w 200 -z 0
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           g1lmd6
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
# nThread 1 nGpus 1 minBytes 16 maxBytes 1048576 step: 2(factor) warmup iters: 200 iters: 200 validation: 1 
#
# Using devices
#   Rank  0 Pid 522477 on     g1lmd6 device  0 [0x07] NVIDIA A100-SXM4-40GB
#   Rank  1 Pid 522478 on     g1lmd6 device  1 [0x0a] NVIDIA A100-SXM4-40GB
NCCL version 2.12.12.MSCCL.0.1+cuda11.4
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
          16             2     float     sum     0.63    0.03    0.01  3e-01
          32             4     float     sum     0.61    0.05    0.03  3e-01
          64             8     float     sum     0.62    0.10    0.05  2e-01
         128            16     float     sum     0.60    0.21    0.11  2e-01
         256            32     float     sum     0.62    0.41    0.21  1e-01
         512            64     float     sum     0.63    0.82    0.41  1e-01
        1024           128     float     sum     0.61    1.68    0.84  1e+00
        2048           256     float     sum     0.61    3.37    1.68  1e+00
        4096           512     float     sum     0.61    6.72    3.36  1e+00
        8192          1024     float     sum     0.61   13.47    6.73  1e+00
       16384          2048     float     sum     0.62   26.61   13.31  1e+00
       32768          4096     float     sum     0.62   53.10   26.55  1e+00
       65536          8192     float     sum     0.62  106.25   53.12  1e+00
      131072         16384     float     sum     0.62  212.12  106.06  1e+00
      262144         32768     float     sum     0.61  427.40  213.70  1e+00
      524288         65536     float     sum     0.61  858.05  429.02  1e+00
     1048576        131072     float     sum     0.61  1722.80  861.40  1e+00
# Out of bounds values : 34 FAILED
# Avg bus bandwidth    : 100.976 
#

`re` step runs significantly slower compared against an earlier version

Hi, I have tested the following reduce-scatter schedule and observed a significantly larger runtime at large sizes compared against an earlier commit.

schedule.txt

Runtimes at size 134217728 bytes:

cbe70894b4da45fca99539269c857a6afb0a21a3 (origin/master): 16967us
e52c5259b6a9ee34b9c8bb0ff8f8d18b4b6cbe00: 2916.2us

Both runs succeed in datacheck.

After some debugging, it seems that the re step somehow runs significantly slower since commit 179f5f995807e870c3dc99aba76502bf737e5d86. Especially, I have tried the following schedule at several commits:

<algo name="test" proto="Simple" nchannels="1" nchunksperloop="8" ngpus="2" coll="allreduce" inplace="1">
  <gpu id="0" i_chunks="16" o_chunks="0" s_chunks="0">
    <tb id="0" send="-1" recv="-1" chan="0">
      <step s="0" type="re" srcbuf="i" srcoff="0" dstbuf="i" dstoff="8" cnt="8" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="1" i_chunks="8" o_chunks="0" s_chunks="0">
  </gpu>
</algo>

Runtimes at size 150994944 bytes:

cbe70894b4da45fca99539269c857a6afb0a21a3 (origin/master): 17677us
179f5f995807e870c3dc99aba76502bf737e5d86: 29189us
35a1d4b8da248ce97c8924745d897aae5c79eecc (tag: v0.6.2): 2624.8us
e52c5259b6a9ee34b9c8bb0ff8f8d18b4b6cbe00: 2621.7us

Questions about MSCCL's building error

Hello MSCCL team,

Thanks for the excellent work.
I have some issues when building MSCCL in my own environment.

I am currently using an ubuntu 18.04 machine with GPUs connected with PCIe, not NVLINK.
I tried to build msccl on two machines: the first one has 2 x V100 32GB, and the second one has 2 x A5000 GPUs.
Both of them are compiled with CUDA 11.1, and they are set as the default cuda path.

However, when I tried to build MSCCL following the guideline of the official repo, my script got freeze with lots of warnings, and it fails. (I tried to build via source & cloning the git repository, and neither of them has succeeded.)

I tried to solve it by referencing the previous build error issues, but it seems to be not working with my situation.
Also, I am wondering if the MSCCL is compatible only with the system with NVLINK, but not sure about it.

I've attached some error logs (the errors I got when building via source zip file & cloning the git repo).
Can I get some advice on my issue?

error_msccl_git_build.log
error_msccl_source_build.log

Is msccl compatible with cuda 11.3?

I was building torch 1.10.1 + msccl + cuda 11.3, but got build error

[6383/6761] Linking CXX executable bin/utility_ops_gpu_test
FAILED: bin/utility_ops_gpu_test 
: && /usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic caffe2/CMakeFiles/utility_ops_gpu_test.dir/operators/utility_ops_gpu_test.cc.o -o bin/utility_ops_gpu_test  -Wl,-rpath,/home/cloudtest/.conda/envs/torch-test/lib:/usr/local/cuda/lib64:/mnt/vss/_work/1/s/test/pytorch/build/lib:  /usr/local/cuda/lib64/libcudart.so  lib/libgtest_main.a  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda_cpp.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_intel_lp64.so  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_gnu_thread.so  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_core.so  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  lib/libdnnl.a  -ldl  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart.so  /home/cloudtest/.conda/envs/torch-test/lib/libnvToolsExt.so  /usr/local/cuda/lib64/libcufft.so  /usr/local/cuda/lib64/libcurand.so  /usr/local/cuda/lib64/libcublas.so  /usr/local/cuda/lib64/libcudnn.so  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda_cu.so" -Wl,--as-needed  lib/libgtest.a  -pthread && :
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
collect2: error: ld returned 1 exit status
[6384/6761] Linking CXX executable bin/roi_align_op_gpu_test
FAILED: bin/roi_align_op_gpu_test 
: && /usr/bin/c++ -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -rdynamic caffe2/CMakeFiles/roi_align_op_gpu_test.dir/operators/roi_align_op_gpu_test.cc.o -o bin/roi_align_op_gpu_test  -Wl,-rpath,/home/cloudtest/.conda/envs/torch-test/lib:/usr/local/cuda/lib64:/mnt/vss/_work/1/s/test/pytorch/build/lib:  /usr/local/cuda/lib64/libcudart.so  lib/libgtest_main.a  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda_cpp.so" -Wl,--as-needed  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cpu.so" -Wl,--as-needed  lib/libprotobuf.a  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_intel_lp64.so  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_gnu_thread.so  /home/cloudtest/.conda/envs/torch-test/lib/libmkl_core.so  -fopenmp  /usr/lib/x86_64-linux-gnu/libpthread.so  -lm  /usr/lib/x86_64-linux-gnu/libdl.so  lib/libdnnl.a  -ldl  lib/libc10_cuda.so  lib/libc10.so  /usr/local/cuda/lib64/libcudart.so  /home/cloudtest/.conda/envs/torch-test/lib/libnvToolsExt.so  /usr/local/cuda/lib64/libcufft.so  /usr/local/cuda/lib64/libcurand.so  /usr/local/cuda/lib64/libcublas.so  /usr/local/cuda/lib64/libcudnn.so  -Wl,--no-as-needed,"/mnt/vss/_work/1/s/test/pytorch/build/lib/libtorch_cuda_cu.so" -Wl,--as-needed  lib/libgtest.a  -pthread && :
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
/usr/bin/ld: /usr/local/cuda/lib64/libcublas.so: undefined reference to `[email protected]'
collect2: error: ld returned 1 exit status

So but (torch 1.10.1 + msccl + cuda11.1) and (torch 1.10.1 + nccl + cuda 11.3) both works, so I was wondering if msccl is compatible with cuda 11.3?

Performance of msccl did not improve compared with nccl

Hi, I just ran nccl-tests on 2 nodes with 8 nvlinked A100s on each of them by using the test.xml built by ‘python msccl-tools/examples/mscclang/allreduce_a100_allpairs.py --protocol=LL 16 4 > test.xml’, and I set NCCL_DEBUG as INFO to make sure I have connected nccl with msccl algorithms by 'NCCL INFO Connected 1 MSCCL algorithms'. I got this nccl-tests result:

I found in-place results became much slower than out-of-place results when the data size was larger. Could you help me with that?

chained reductions

Hi folks,

There may be a bug with how msccl handles chained reductions right now-- I am providing some examples below that cause trouble at the moment. It would be great if you can shed some light on how msccl deals with these scenarios or what's the recommended way to do these reductions.

scenario 1

If there are several contiguous reductions into the same dst buffer and offset, and these reductions have different dependencies, then we had to place nop ops to capture all the dependencies before the first reduction; first reduction can have a dependency but not the later ones.

So this FAILs the check

      <step s="12" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="0" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="0" cnt="1" depid="1" deps="7" hasdep="0"/>      
      <step s="13" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="0" cnt="1" depid="3" deps="7" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>

scenario 2

If the reductions are contiguous but not into the same offset (they go into different dst buffer offsets), we also see the NCCL check FAIL

<step s="4" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="10" cnt="1" depid="2" deps="3" hasdep="0"/>
<step s=“5" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="11" cnt="1" depid="1" deps="3" hasdep="0"/>

A solution that seems to fix this is to add a dummy send step in between like the following:

<step s="4" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="10" cnt="1" depid="2" deps="3" hasdep="0"/>
<step s="5" type="s" srcbuf="i" srcoff="-1" dstbuf="o" dstoff="-1" cnt="0" depid="-1" deps="-1" hasdep="0"/>
<step s="6" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="11" cnt="1" depid="1" deps="3" hasdep="0"/>

I attached the original schedule xml where both of these issues manifest.

Any thoughts on the recommended way to do contiguous reductions (would be great to understand how msccl chains these reductions to get an intuition for things)?

Attachment

Here is the original schedule that FAILs the check

<algo name="reducescatter_RS_n8_c1_Simple" proto="Simple" nchannels="1" nchunksperloop="32" ngpus="8" coll="reduce_scatter" inplace="1">
  <gpu id="0" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="4" recv="4" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="4" dstbuf="s" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="8" dstbuf="s" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="12" dstbuf="s" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="0" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="12" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="12" dstbuf="i" dstoff="16" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="13" dstbuf="i" dstoff="17" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="14" dstbuf="i" dstoff="18" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="s" srcbuf="i" srcoff="16" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="11" type="nop" srcbuf="i" srcoff="-1" dstbuf="o" dstoff="-1" cnt="0" depid="1" deps="7" hasdep="0"/>
      <step s="12" type="nop" srcbuf="i" srcoff="-1" dstbuf="o" dstoff="-1" cnt="0" depid="3" deps="7" hasdep="0"/>      
      <step s="12" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="0" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="0" cnt="1" depid="1" deps="7" hasdep="0"/>      
      <step s="13" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="0" cnt="1" depid="3" deps="7" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
    <tb id="1" send="5" recv="5" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="5" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="20" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="20" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="21" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="22" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="1" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="1" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="1" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="2" send="6" recv="6" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="9" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="24" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="24" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="2" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="2" cnt="1" depid="1" deps="7" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="2" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="3" send="7" recv="7" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="13" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="16" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="28" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="28" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="29" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="30" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="3" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="3" cnt="1" depid="1" deps="7" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="3" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="1" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="4" recv="4" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="21" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="16" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="16" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="4" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="4" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="4" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="4" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="1" send="5" recv="5" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="0" dstbuf="s" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="8" dstbuf="s" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="15" dstbuf="s" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="0" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="12" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="12" dstbuf="i" dstoff="20" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="21" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="13" dstbuf="i" dstoff="21" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="22" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="14" dstbuf="i" dstoff="22" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="23" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="12" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="13" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="5" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="14" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="5" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="15" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="5" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="16" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="5" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="2" send="6" recv="6" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="1" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="24" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="24" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="6" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="6" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="6" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="6" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="3" send="7" recv="7" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="9" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="28" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="29" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="30" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="31" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="7" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="7" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="7" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="7" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="2" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="4" recv="4" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="25" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="16" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="17" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="18" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="19" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="16" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="8" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="8" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="8" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="8" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="8" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="1" send="5" recv="5" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="17" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="23" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="8" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="9" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="9" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="9" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="9" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="2" send="6" recv="6" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="0" dstbuf="s" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="7" dstbuf="s" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="15" dstbuf="s" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="0" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="12" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="12" dstbuf="i" dstoff="24" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="25" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="13" dstbuf="i" dstoff="25" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="26" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="14" dstbuf="i" dstoff="26" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="27" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="s" srcbuf="i" srcoff="24" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="12" type="r" srcbuf="i" srcoff="8" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="13" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="10" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="14" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="10" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="15" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="10" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="16" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="10" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="3" send="7" recv="7" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="1" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="24" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="31" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="8" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="11" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="11" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="11" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="11" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="3" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="4" recv="4" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="29" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="17" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="18" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="19" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="12" dstbuf="i" dstoff="19" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="16" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="12" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="12" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="12" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="12" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="1" send="5" recv="5" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="25" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="13" dstbuf="i" dstoff="23" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="13" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="13" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="13" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="2" send="6" recv="6" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="8" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="17" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="25" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="26" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="27" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="14" dstbuf="i" dstoff="27" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="24" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="14" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="14" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="14" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="3" send="7" recv="7" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="3" dstbuf="s" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="7" dstbuf="s" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="11" dstbuf="s" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="0" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="12" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="29" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="30" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="31" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="10" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="15" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="15" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="12" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="15" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="13" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="15" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="4" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="0" recv="0" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="20" dstbuf="s" dstoff="12" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="24" dstbuf="s" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="28" dstbuf="s" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="12" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="0" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="0" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="1" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="r" srcbuf="i" srcoff="16" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="10" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="16" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="16" cnt="1" depid="1" deps="7" hasdep="0"/>
      <step s="12" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="16" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="13" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="16" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="1" send="1" recv="1" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="21" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="4" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="4" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="5" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="6" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="16" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="17" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="17" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="17" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="17" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="2" send="2" recv="2" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="25" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="1" dstbuf="i" dstoff="8" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="8" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="16" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="18" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="18" cnt="1" depid="1" deps="7" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="18" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="18" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="3" send="3" recv="3" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="29" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="0" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="2" dstbuf="i" dstoff="12" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="12" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="16" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="19" cnt="1" depid="0" deps="9" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="19" cnt="1" depid="1" deps="7" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="19" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="19" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="5" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="0" recv="0" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="5" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="0" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="20" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="20" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="20" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="20" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="1" send="1" recv="1" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="16" dstbuf="s" dstoff="12" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="24" dstbuf="s" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="31" dstbuf="s" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="12" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="0" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="4" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="5" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="5" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="1" dstbuf="i" dstoff="6" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="6" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="2" dstbuf="i" dstoff="7" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="12" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="13" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="21" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="14" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="21" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="15" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="21" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="16" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="21" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="2" send="2" recv="2" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="17" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="12" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="8" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="8" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="22" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="22" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="22" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="22" cnt="1" depid="3" deps="7" hasdep="0"/>
    </tb>
    <tb id="3" send="3" recv="3" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="25" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="4" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="12" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="15" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="20" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="23" cnt="1" depid="0" deps="4" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="23" cnt="1" depid="1" deps="12" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="23" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="23" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="6" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="0" recv="0" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="9" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="0" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="1" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="24" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="24" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="24" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="24" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="24" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="1" send="1" recv="1" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="28" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="1" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="7" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="24" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="25" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="25" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="25" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="25" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="2" send="2" recv="2" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="16" dstbuf="s" dstoff="12" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="23" dstbuf="s" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="31" dstbuf="s" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="12" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="0" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="8" cnt="1" depid="3" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="9" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="9" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="1" dstbuf="i" dstoff="10" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="10" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="2" dstbuf="i" dstoff="11" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="s" srcbuf="i" srcoff="8" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="12" type="r" srcbuf="i" srcoff="24" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="13" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="26" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="14" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="26" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="15" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="26" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="16" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="26" cnt="1" depid="3" deps="4" hasdep="0"/>
    </tb>
    <tb id="3" send="3" recv="3" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="17" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="8" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="15" cnt="1" depid="2" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="24" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="27" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="27" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="27" cnt="1" depid="2" deps="12" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="27" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
  <gpu id="7" i_chunks="32" o_chunks="0" s_chunks="32">
    <tb id="0" send="0" recv="0" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="16" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="13" dstbuf="s" dstoff="0" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="6" dstbuf="i" dstoff="1" cnt="1" depid="2" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="7" dstbuf="i" dstoff="2" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="8" dstbuf="i" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="9" dstbuf="i" dstoff="3" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="0" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="15" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="15" dstbuf="i" dstoff="28" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="19" dstbuf="i" dstoff="28" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="23" dstbuf="i" dstoff="28" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="27" dstbuf="i" dstoff="28" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="1" send="1" recv="1" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="20" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="9" dstbuf="s" dstoff="3" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="10" dstbuf="i" dstoff="7" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="3" type="s" srcbuf="i" srcoff="4" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="19" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="5" type="re" srcbuf="s" srcoff="16" dstbuf="i" dstoff="29" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="20" dstbuf="i" dstoff="29" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="24" dstbuf="i" dstoff="29" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="8" type="re" srcbuf="s" srcoff="28" dstbuf="i" dstoff="29" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="2" send="2" recv="2" chan="0">
      <!--comm step 1-->
      <step s="0" type="s" srcbuf="i" srcoff="24" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="r" srcbuf="i" srcoff="1" dstbuf="s" dstoff="6" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="2" type="re" srcbuf="s" srcoff="3" dstbuf="i" dstoff="9" cnt="1" depid="1" deps="1" hasdep="0"/>
      <step s="3" type="re" srcbuf="s" srcoff="4" dstbuf="i" dstoff="10" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="re" srcbuf="s" srcoff="5" dstbuf="i" dstoff="11" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="5" type="re" srcbuf="s" srcoff="11" dstbuf="i" dstoff="11" cnt="1" depid="3" deps="4" hasdep="0"/>
      <step s="6" type="s" srcbuf="i" srcoff="8" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="23" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="8" type="re" srcbuf="s" srcoff="17" dstbuf="i" dstoff="30" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="9" type="re" srcbuf="s" srcoff="21" dstbuf="i" dstoff="30" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="10" type="re" srcbuf="s" srcoff="25" dstbuf="i" dstoff="30" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="29" dstbuf="i" dstoff="30" cnt="1" depid="3" deps="9" hasdep="0"/>
    </tb>
    <tb id="3" send="3" recv="3" chan="0">
      <!--comm step 1-->
      <step s="0" type="cpy" srcbuf="i" srcoff="19" dstbuf="s" dstoff="12" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="1" type="cpy" srcbuf="i" srcoff="23" dstbuf="s" dstoff="13" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="2" type="cpy" srcbuf="i" srcoff="27" dstbuf="s" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="3" type="s" srcbuf="s" srcoff="12" dstbuf="s" dstoff="12" cnt="3" depid="-1" deps="-1" hasdep="0"/>
      <step s="4" type="r" srcbuf="s" srcoff="0" dstbuf="s" dstoff="9" cnt="3" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 2-->
      <step s="5" type="re" srcbuf="s" srcoff="0" dstbuf="i" dstoff="13" cnt="1" depid="0" deps="1" hasdep="0"/>
      <step s="6" type="re" srcbuf="s" srcoff="1" dstbuf="i" dstoff="14" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="7" type="re" srcbuf="s" srcoff="2" dstbuf="i" dstoff="15" cnt="1" depid="-1" deps="-1" hasdep="0"/>
      <step s="8" type="s" srcbuf="i" srcoff="12" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="0"/>
      <step s="9" type="r" srcbuf="i" srcoff="28" dstbuf="s" dstoff="27" cnt="4" depid="-1" deps="-1" hasdep="1"/>
      <!--comm step 3-->
      <step s="10" type="re" srcbuf="s" srcoff="18" dstbuf="i" dstoff="31" cnt="1" depid="0" deps="7" hasdep="0"/>
      <step s="11" type="re" srcbuf="s" srcoff="22" dstbuf="i" dstoff="31" cnt="1" depid="1" deps="4" hasdep="0"/>
      <step s="12" type="re" srcbuf="s" srcoff="26" dstbuf="i" dstoff="31" cnt="1" depid="2" deps="7" hasdep="0"/>
      <step s="13" type="re" srcbuf="s" srcoff="30" dstbuf="i" dstoff="31" cnt="1" depid="-1" deps="-1" hasdep="0"/>
    </tb>
  </gpu>
</algo>

Custom Ring All-Reduce between GPU0 and and GPU 7 is not compiled

Hi, I have a dgx-1 like system and I want to write a custom collective algorithm that performs All-Reduce between 2 GPUs (GPU0 and GPU7) using the NVLinks. My collective is as follows:

`
def allreduce_ring(size, instances, channels, protocol):
topology = fully_connected(size)
collective = AllReduce(size, size, True)
ring_order = [0, 3, 2, 1, 5, 6, 4, 7]
with MSCCLProgram(f"allreduce_ring_{channels}channelsperring", topology, collective, instances,
protocol=protocol, threadblock_policy=ThreadblockPolicy.manual):

    # Reduce ring
    for step in range(0, size-1):
        for index in range(0, size):
            rank = 0
            next_rank = 0
            if index<4:
                rank = (0 + step) % size
                next_rank = (0 + step + 1) % size
            else:
                rank = (7 - step) % size
                next_rank = (7 - step - 1) % size
            rank = ring_order[rank]
            next_rank = ring_order [next_rank]
            channel = index%channels
            if step == size-2:
                c = chunk(next_rank, Buffer.input, index)
                c.reduce(chunk(rank, Buffer.input, index), ch=channel, recvtb=channel, sendtb=channel)
            else:
                c = chunk(rank, Buffer.input, index)
                c = c.copy(next_rank, Buffer.input, index, ch=channel, recvtb=channel, sendtb=channel)

    # Propagate ring
    for step in range(0, size-1):
        for index in range(0, size):
            rank = 0
            next_rank = 0
            if index>=4:
                rank = (0 + step) % size
                next_rank = (0 + step + 1) % size
            else:
                rank = (7 - step) % size
                next_rank = (7 - step - 1) % size
            rank = ring_order[rank]
            next_rank = ring_order [next_rank]
            channel = index%channels
            c = chunk(rank, Buffer.input, index)
            c = c.copy(next_rank, Buffer.input, index, ch=channel, recvtb=channel, sendtb=channel)
           
    XML()

parser = argparse.ArgumentParser()
parser.add_argument('--num_gpus', type=int, help ='number of gpus')
parser.add_argument('--channels', type=int, help='Number of channels to use for 1 instance of the ring [1-8]')
parser.add_argument('--instances', type=int, help='number of instances')
parser.add_argument('--protocol', type=str, default='LL128', choices=['Simple', 'LL', 'LL128'], help ='NCCL protocol. Default: LL128')
args = parser.parse_args()

allreduce_ring(args.num_gpus, args.instances, args.channels, args.protocol)
`

However, when trying to compile it and generate the xml, I get the following error:

Traceback (most recent call last): File "/home/rashidi/msccl-tools/examples/mscclang/allreduce_a100_ring_0_7.py", line 67, in <module> allreduce_ring(args.num_gpus, args.instances, args.channels, args.protocol) File "/home/rashidi/msccl-tools/examples/mscclang/allreduce_a100_ring_0_7.py", line 57, in allreduce_ring XML() File "/home/rashidi/.local/lib/python3.10/site-packages/msccl/language/__init__.py", line 153, in XML print(_curr().generate_xml()) File "/home/rashidi/.local/lib/python3.10/site-packages/msccl/language/__init__.py", line 129, in generate_xml return ir_to_xml(self.lower(), dependence_nop=self.dependence_nop) File "/home/rashidi/.local/lib/python3.10/site-packages/msccl/language/__init__.py", line 116, in lower manual_assign_tbs(self.instr_dag) File "/home/rashidi/.local/lib/python3.10/site-packages/msccl/language/tb_assignment.py", line 39, in manual_assign_tbs raise Exception(f"Illegal threadblock assignment. Trying to add {op} to threadblock {tbid}\n" \ Exception: Illegal threadblock assignment. Trying to add Op(rcs, 5, Ref(Buffer:i, Index:0, Size:1, Rank:1), Ref(Buffer:i, Index:0, Size:1, Rank:6), step:-1, tb:0) to threadblock 0 Threadblock 0 send:1 recv:6 channel:0 Operation send:6 recv:6 channel:0
I really appreciate if you have any solution to this!

What is the difference between microsoft/msccl and Azure/msccl-executor-nccl?

What is the difference between this code repo and Azure's msccl-executor-nccl (https://github.com/Azure/msccl-executor-nccl )?
Both can be utilized for executing customized MSCCL algorithms, yet it appears that the NCCL baseline version employed by Azure's msccl-executor-nccl is 2.18, more recent than that used in this repo.

All-to-all correctness check does not seem to work

The all-to-all correctness check seems to pass on any schedule. Even ones that are intentionally modified to be wrong. It is possible that the schedule got ignored, but I have set -x NCCL_DEBUG=WARN and saw nothing.

The command I ran is as follows:
mpirun -np 8 -x NCCL_NET_SHARED_BUFFERS=0 -x MSCCL_XML_FILES=schedule/allpairs_alltoall.xml -x NCCL_ALGO=MSCCL -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 -x NCCL_DEBUG=WARN -x NCCL_DEBUG_SUBSYS=ALL -x LD_LIBRARY_PATH=~/msccl/build/lib/:$LD_LIBRARY_PATH ~/nccl-tests/build/alltoall_perf -b 32 -e 16777216 -f 2 -g 1 -c 1 -n 200 -w 200 -z 0

I am using the latest msccl and nccl-tests.

TACCL code publishing plan

Hi, I've read your TACCL paper and code in this repo. I'm wondering if TACCL code is in this repo yet (I think not). Do you have a timetable or plan of publishing it? Thanks!

working example

Hi, is there any example of an algorithm spec that can be lowered and run with sccl-rt or is there a plan for this to be the case eventually?

I tried exporting a SCCL algorithm but it didnt seem the two repos are in sync on format etc.

NCCL_ALGO=SCCL SCCL_XML_FILES=/development/sccl-rt/allgather-n4-ring-oldformat.xml NCCL_DEBUG=INFO  ./build/all_gather_perf -t 4 -g 1 -b 16MB -e 16MB -f 2

even after getting it to parse with old_format, the allgather collective didnt recognize the algorithm.

nvlink errors when building msccl

I got lots of nvlink errors when building msccl:

Could please you help me with that?

MSCCL all-to-all performance did not improve compared with NCCL

Hi, I have tried nccl-alltoall_perf-tests on 1/2/8 nodes with 8xA100 GPUs and found that the performance of msccl(in-place) did not imporve compared with nccl(out-of-place). My MSCCL_XML_FILES were generated by python msccl-tools/examples/mscclang/alltoall_a100_two_step.py.py --protocol=LL 8 8 > two_step_64.xml. I also tried alltoall_a100_three_step.py and alltoall_allpairs.py, they all behaved similarly.
The test code is nccl-tests/build/alltoall_perf -b 1MB -e 1024MB -f 2 -g 1 -n 100 -w 100, and I used 8/16/64 GPUs to run it, corresponding to 1/2/8 nodes.
The alltoall-test result of 8 nodes is like this:

#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
     1048576          4096     float    none      -1   9012.2    0.12    0.11      0    561.5    1.87    1.84    N/A
     2097152          8192     float    none      -1   1067.7    1.96    1.93      0   1046.3    2.00    1.97    N/A
     4194304         16384     float    none      -1   2010.8    2.09    2.05      0   2023.0    2.07    2.04    N/A
     8388608         32768     float    none      -1   5698.5    1.47    1.45      0   4261.4    1.97    1.94    N/A
    16777216         65536     float    none      -1   8339.5    2.01    1.98      0   8211.3    2.04    2.01    N/A
    33554432        131072     float    none      -1    16235    2.07    2.03      0    16281    2.06    2.03    N/A
    67108864        262144     float    none      -1    32252    2.08    2.05      0    51440    1.30    1.28    N/A
   134217728        524288     float    none      -1    63877    2.10    2.07      0    83221    1.61    1.59    N/A
   268435456       1048576     float    none      -1   147334    1.82    1.79      0   142747    1.88    1.85    N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 1.77934

I also find that the Avg bus bandwidth drops sharply on multi-nodes(2/8) compared with one node. I have attached the logs of 8/16/64 GPUs below. Thank you!
gpu8-two_step.log
gpu16-two_step.log
gpu64-two_step.log

Compilation failure

I failed the build, with uint64_t undefined.
After #include <cstdint> in src/include/collectives.h it works.
Maybe there should be a better way.

../../include/collectives.h(18): error: identifier "uint64_t" is undefined
    uint64_t scalarArg;
    ^
1 error detected in the compilation of "custom_collective.cu".
make[2]: *** [/home/ubuntu/jihao/msccl/build/obj/collectives/device/Makefile.rules:2274: /home/ubuntu/jihao/msccl/build/obj/collectives/device/custom_collective_max_i64.o] Error 2

nccl baseline

which nccl version does msccl use as baseline?

How to set the target buffer size range for MSCCL algorithms

Hi,
When I am writing MSCCL All-Reduce algorithms and run nccltest, it seems that for all-reduces greater than 64 MB, the MSCCL runtime falls back to the NCLL algorithm, rather than executing my MSCCL algorithm. How can we determine the buffer size range for our MSCCL algorithms?

Thanks

nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum

I followed the instructions in the README, but when it got time to run nccl tests I get the following error on an ndv4 node in azure:

Command

mpirun -np 8 -mca pml ob1 --mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 -mca  coll_hcoll_enable 0  -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH \
  -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x NCCL_NET_SHARED_BUFFERS=0 -x SCCL_XML_FILES=msccl/src/xml_generator/ar_ll.xml:msccl/src/xml_generator/ar_ll128.xml -x NCCL_ALGO=SCCL,RING,TREE -x CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7  \
    nccl-tests/build/all_reduce_perf -b 128 -e 64MB -f 2 -g 1 -c 1 -n 1000 -w 1000 -z 0

Error

nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
nccl-tests/build/all_reduce_perf: symbol lookup error: nccl-tests/build/all_reduce_perf: undefined symbol: ncclRedOpCreatePreMulSum
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[10610,1],0]
  Exit code:    127
--------------------------------------------------------------------------

ReadMe Instructions

$ git clone -b lowlatency https://github.com/microsoft/msccl.git
$ cd msccl/
$ make -j src.build
$ cd ../
$ git clone https://github.com/nvidia/nccl-tests.git
$ cd nccl-tests/
$ make MPI=1 NCCL_HOME=../msccl/build/ -j 
$ cd ../

`nvcc -V`

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0