Comments (54)
You need to look at NCCL log
NCCL_DEBUG=INFO
to see which one it uses. My guess is that it will only use the fastest one since I think (could be wrong here) your PCIe bandwidth can at max draw 200Gbps.
It should use the fastest one(0), because we can find some infos in previous logs:
SH-IDC1-10-142-5-97:73905:73905 [2] NCCL INFO Using network IB
SH-IDC1-10-142-5-97:73910:73910 [7] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE ; OOB ib0:10.142.205.97<0>
SH-IDC1-10-142-6-127:49330:49330 [6] NCCL INFO Using network IB
SH-IDC1-10-142-6-127:49324:49324 [4] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_2:1/RoCE ; OOB ib0:10.142.206.127<0>
...
SH-IDC1-10-142-5-203:76752:77230 [7] NCCL INFO Channel 01 : 23[dc000] -> 24[41000] [send] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76745:77233 [0] NCCL INFO Channel 01 : 15[dc000] -> 16[51000] [receive] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76746:77234 [1] NCCL INFO Channel 01 : 17[56000] -> 18[6b000] via P2P/IPC/read
SH-IDC1-10-142-5-203:76751:77235 [6] NCCL INFO Channel 00 : 22[d9000] -> 24[41000] [send] via NET/IB/0/Shared
SH-IDC1-10-142-5-203:76746:77234 [1] NCCL INFO Channel 00 : 15[dc000] -> 17[56000] [receive] via NET/IB/0/Shared
...
from msccl.
What input sizes do you need the alltoall for?
256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?See if this helps https://github.com/microsoft/inspector-topo We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.
Another good option isnvidia-smi topo -m
which might give you more information. If you can share that log in here, that would be great.Here is the result of
nvidia-smi topo -m
in our machine:GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 CPU Affinity NUMA Affinity GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB 0-27,56-83 0 GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB 0-27,56-83 0 GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 NODE NODE NODE 0-27,56-83 0 GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 NODE NODE NODE 0-27,56-83 0 GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS 28-55,84-111 1 GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS 28-55,84-111 1 GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS 28-55,84-111 1 GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS 28-55,84-111 1 mlx5_0 PXB PXB NODE NODE SYS SYS SYS SYS X PIX PIX mlx5_1 PXB PXB NODE NODE SYS SYS SYS SYS PIX X PIX mlx5_2 PXB PXB NODE NODE SYS SYS SYS SYS PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.
from msccl.
There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?
Exactly! BTW, I am not sure if it will work better for you, but just judging from my intuitions.
By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?
So, in the two-step algorithm, all of the cross-node traffic, for example, to local GPU 6 on node 0 is via local GPU 6 on other nodes. So, local GPU i on node j aggregates the traffic from other local GPUs on node j that needs to go to other local GPU i on other nodes.
In three-step algorithm, the aggregation is more aggressive, and it was only beneficial in our experiments at 64 node scale and larger. In this algorithm, node i and node j communicate via only one local GPU on each (the CrossNodeGpus logigc). In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try!
Also, try NPKit profiler for further analysis of your algorithm!
from msccl.
From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.
If I understand correctly, in 3-step algorithm, only one GPU's NIC will be used to exchange cross-node traffic. If one node has 8GPU and 8 NIC,the available bandwidth will be decreased to 1/8. Could you show more experience why 3-step algorithm could improve performance after 1024 GPUs?
Not exactly. Imagine you have 9 nodes in total. On node-0, NIC-0 will be talking to node-1, NIC-1 will be talking to node-2, ... , and NIC-7 will be talking to node-8. So, you can imagine each first step is a local-gather operation, second step is a cross-node communication and last step is a local scatter operation. This reduces the number of cross-node connections by another 8x over the 2-step algorithm.
In general, for a perfect load-balance, we need 8K+1 nodes for 3-step algorithm. However, at scale (1024 GPUs for example), the load imbalance is not too bad.
from msccl.
Hi @Musisoul, I think you used the original nccl-tests which doesn't have ncclAllToAll
and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.
Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with NCCL_DEBUG=INFO
environment variable, I would be able to help you better.
Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!
from msccl.
Hi @Musisoul, I think you used the original nccl-tests which doesn't have
ncclAllToAll
and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with
NCCL_DEBUG=INFO
environment variable, I would be able to help you better.Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!
Thanks for your reply! I'll change the version of nccl-tests and try again.
As for the performance drop, I have attached the logs using NCCL_DEBUG=INFO
below.
gpu8-two_step_complete.log
gpu64-two_step_complete.log
from msccl.
Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API ncclAllToAll
, which is currently not compatible with nccl API? nccl currently doesn't have this API.
from msccl.
I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?
- native implemented msccl2DAllToAll
- algorithms generated using msccl-tools
Three questions:
- what's the correct way to let msccl use native msccl2DAllToAll?
- is NCCL_ALGO used to specify the priority order of available algos?
- when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?
from msccl.
Hi @Musisoul, I think you used the original nccl-tests which doesn't have
ncclAllToAll
and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log with
NCCL_DEBUG=INFO
environment variable, I would be able to help you better.Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!
Hi, I have tried the forked version of nccl-tests, the result is different from previous, but the in-place result is extremely high compared with out-of-place. That seems to be abnormal. Here is the 2-nodes(16 GPUs) all-to-all test result:
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 16384 float 571.5 1.83 1.72 0e+00 0.50 2087.88 1957.39 1e+00
2097152 32768 float 1043.0 2.01 1.88 0e+00 0.21 9985.96 9361.84 1e+00
4194304 65536 float 1972.5 2.13 1.99 0e+00 0.22 18941.04 17757.23 1e+00
8388608 131072 float 3909.3 2.15 2.01 0e+00 0.51 16433.75 15406.64 1e+00
16777216 262144 float 7596.3 2.21 2.07 0e+00 0.45 37659.30 35305.59 1e+00
33554432 524288 float 14892 2.25 2.11 0e+00 0.17 196339.57 184068.34 1e+00
67108864 1048576 float 29605 2.27 2.13 0e+00 0.21 320099.52 300093.30 1e+00
134217728 2097152 float 33285 4.03 3.78 0e+00 0.49 271437.56 254472.71 1e+00
268435456 4194304 float 76900 3.49 3.27 0e+00 0.14 1938582.05 1817420.67 1e+00
536870912 8388608 float 181940 2.95 2.77 0e+00 0.40 1341070.90 1257253.97 1e+00
1073741824 16777216 float 285011 3.77 3.53 0e+00 0.47 2292116.18 2148858.92 1e+00
# Out of bounds values : 0 OK
# Avg bus bandwidth : 274636
The complete logs are attached below.
n8-two_step-complete.log
n16-two_step-complete.log
What's the correct usage of all-to-all tests? Does it differ from allreduce-tests, like generating xmls in msccl-tools and the startup code(e.g. NCCL_ALGO=MSCCL,RING,TREE) in nccl-tests?
Thank you!
from msccl.
Hi @Musisoul, I think you used the original nccl-tests which doesn't have
ncclAllToAll
and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.
Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log withNCCL_DEBUG=INFO
environment variable, I would be able to help you better.
Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!Thanks for your reply! I'll change the version of nccl-tests and try again. As for the performance drop, I have attached the logs using
NCCL_DEBUG=INFO
below. gpu8-two_step_complete.log gpu64-two_step_complete.log
This is an interesting topology you have! 8xA100s with only a single IB! I think your IB has 200Gbps BW which explains your numbers. I think one can design a much better algorithm for your specific topology but the current one you are currently using seems to be delivering a pretty good performance. If you really want the maximum performance, we can help you design the algorithm for your topology.
from msccl.
Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API
ncclAllToAll
, which is currently not compatible with nccl API? nccl currently doesn't have this API.
That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions.
from msccl.
Thanks for the kindly reply. We test according to msccl readme. the changes in alltoall.cu means msccl generated new API
ncclAllToAll
, which is currently not compatible with nccl API? nccl currently doesn't have this API.
That's correct. NCCL made the decision not to make alltoall a collective with an API but RCCL (AMD's version of NCCL) and any MPI implementation has alltoall as an API. Because of this, we apply a patch to PyTorch to change alltoall's implementation with a call to our API. New PyTorch has this API for RCCL in the latest versions.
from msccl.
Hi @Musisoul, I think you used the original nccl-tests which doesn't have
ncclAllToAll
and therefore, all of your logs correspond to NCCL's alltoall algorithm. Please change the nccl-tests alltoall.cu's way of calling alltoall to this way. Alternatively, you could just use this forked version of nccl-tests to test alltoall performance.
Regarding the performance drop: clearly for cross-node communication, some networking interface will be used. If you can provide your log withNCCL_DEBUG=INFO
environment variable, I would be able to help you better.
Lastly, please note that in-place alltoall is impossible to perform correctly. As you can see in your logs, the #wrong column is N/A for in-place alltoall even for NCCL runs. The algorithms that you generated with msccl-tools is only valid for out-of-place as well.
I hope this helps!Hi, I have tried the forked version of nccl-tests, the result is different from previous, but the in-place result is extremely high compared with out-of-place. That seems to be abnormal. Here is the 2-nodes(16 GPUs) all-to-all test result:
# out-of-place in-place # size count type redop time algbw busbw error time algbw busbw error # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 16384 float 571.5 1.83 1.72 0e+00 0.50 2087.88 1957.39 1e+00 2097152 32768 float 1043.0 2.01 1.88 0e+00 0.21 9985.96 9361.84 1e+00 4194304 65536 float 1972.5 2.13 1.99 0e+00 0.22 18941.04 17757.23 1e+00 8388608 131072 float 3909.3 2.15 2.01 0e+00 0.51 16433.75 15406.64 1e+00 16777216 262144 float 7596.3 2.21 2.07 0e+00 0.45 37659.30 35305.59 1e+00 33554432 524288 float 14892 2.25 2.11 0e+00 0.17 196339.57 184068.34 1e+00 67108864 1048576 float 29605 2.27 2.13 0e+00 0.21 320099.52 300093.30 1e+00 134217728 2097152 float 33285 4.03 3.78 0e+00 0.49 271437.56 254472.71 1e+00 268435456 4194304 float 76900 3.49 3.27 0e+00 0.14 1938582.05 1817420.67 1e+00 536870912 8388608 float 181940 2.95 2.77 0e+00 0.40 1341070.90 1257253.97 1e+00 1073741824 16777216 float 285011 3.77 3.53 0e+00 0.47 2292116.18 2148858.92 1e+00 # Out of bounds values : 0 OK # Avg bus bandwidth : 274636
The complete logs are attached below. n8-two_step-complete.log n16-two_step-complete.log
What's the correct usage of all-to-all tests? Does it differ from allreduce-tests, like generating xmls in msccl-tools and the startup code(e.g. NCCL_ALGO=MSCCL,RING,TREE) in nccl-tests? Thank you!
These numbers make sense to me given your topology. Regarding in-place: note that the error column is non-zero which means that the implementation is incorrect. The reason is that in the forked repo, we completely disabled in-place alltoall as no one expect in-pace alltoall to work correctly. That's why the in-place numbers are crazy high since they do not perform any communication and it immediately returns.
from msccl.
I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?
- native implemented msccl2DAllToAll
- algorithms generated using msccl-tools
Three questions:
- what's the correct way to let msccl use native msccl2DAllToAll?
- is NCCL_ALGO used to specify the priority order of available algos?
- when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?
I found msccl2DAllToAll in msccl , seems there are two kinds of all2all in msccl?
- native implemented msccl2DAllToAll
- algorithms generated using msccl-tools
Three questions:
- what's the correct way to let msccl use native msccl2DAllToAll?
- is NCCL_ALGO used to specify the priority order of available algos?
- when generating allreduce algos in msccl-tools, we should specify num nodes, this means whenever I changed the training scale, I should first generate coordinate msccl xmls?
Great questions!
AllToAll: msccl2DAllToAll is triggered if the name
filed inside XML is just 2D
. However, it only performs better than the one you are generating with msccl-tools when there are >1K GPUs. This was a hacky way for us to compare the performance of the two approaches it will be removed in the future releases. NCCL_ALGO doesn't need to change.
msccl.init
from here is an automatic way to manage these XMLs. But given that you have a unique topology, you won't necessarily get the maximum bang for the buck with the same algorithms. The ones that we have studied the performance for were on 8xA100 nodes with 8xIBs such as NDv4 SKU offered on Azure.
I hope this helps.
from msccl.
If you really want the maximum performance, we can help you design the algorithm for your topology
sure, we really want the maximum performance under 42 A100 nodes with only 1 200Gbps IB per node. it would be great if you guys can help design the algorithm.
BTW, I've read the gc3 paper and find it's inspiring, it's very useful and effective in this case.
from msccl.
What input sizes do you need the alltoall for? Your algBW at max should be 25GBps/(N-1)
. For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Hope this helps.
from msccl.
What input sizes do you need the alltoall for?
256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?
from msccl.
What input sizes do you need the alltoall for? Your algBW at max should be
25GBps/(N-1)
. For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Hope this helps.
As Jack47 says, the size we need is 256MiB. BTW, I have tried the nccl-tests(forked version) and compared the time and busbw between NCCL and MSCCL algos, both at out-of-place. It seems that MSCCL algos did not improve. I have used 8 nodes to test and the size was 256MiB. What do you mean by quoting
the current one you are currently using seems to be delivering a pretty good performance.
Thank you!
time(us) | busbw(GB/s) | |
---|---|---|
NCCL(out of place) | 123058 | 2.15 |
MSCCL two_step(out of place) | 165042 | 1.60 |
MSCCL three_step(out of place) | 150965 | 1.78 |
from msccl.
What input sizes do you need the alltoall for?
256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?
See if this helps https://github.com/microsoft/inspector-topo
We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.
Another good option is nvidia-smi topo -m
which might give you more information. If you can share that log in here, that would be great.
from msccl.
What input sizes do you need the alltoall for? Your algBW at max should be
25GBps/(N-1)
. For example, for 8 nodes, the maximum you can get at very large sizes is 3.57GBps algBW.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Hope this helps.As Jack47 says, the size we need is 256MiB. BTW, I have tried the nccl-tests(forked version) and compared the time and busbw between NCCL and MSCCL algos, both at out-of-place. It seems that MSCCL algos did not improve. I have used 8 nodes to test and the size was 256MiB. What do you mean by quoting
the current one you are currently using seems to be delivering a pretty good performance.
Thank you!
time(us) busbw(GB/s)
NCCL(out of place) 123058 2.15
MSCCL two_step(out of place) 165042 1.60
MSCCL three_step(out of place) 150965 1.78
I am not surprised. Both the 3-step and 2-step algorithms are over-subscribing the single IB on each node. Those two were not designed for your topology. I think a modified 3-step algorithm where all cross-node traffic goes through a single GPU would work the best.
NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?
from msccl.
What input sizes do you need the alltoall for?
256MiB.
About the relay logic, do we need to find proper local GPU? I don't know how to find the GPU which is nearest to IB, bypeps paper says this will be faster?See if this helps https://github.com/microsoft/inspector-topo We have a similar type of node on Azure and we use inspector-topo to find the IB placement. However, it might now matter much given that you have a newer machine.
Another good option is
nvidia-smi topo -m
which might give you more information. If you can share that log in here, that would be great.
Here is the result of nvidia-smi topo -m
in our machine:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 mlx5_1 mlx5_2 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB 0-27,56-83 0
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB PXB PXB 0-27,56-83 0
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 NODE NODE NODE 0-27,56-83 0
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 NODE NODE NODE 0-27,56-83 0
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS 28-55,84-111 1
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS 28-55,84-111 1
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS 28-55,84-111 1
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS 28-55,84-111 1
mlx5_0 PXB PXB NODE NODE SYS SYS SYS SYS X PIX PIX
mlx5_1 PXB PXB NODE NODE SYS SYS SYS SYS PIX X PIX
mlx5_2 PXB PXB NODE NODE SYS SYS SYS SYS PIX PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
from msccl.
There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?
By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?
from msccl.
Great thanks for you help, will use NPKit profiler to get more information. BTW, after finding a way to improve all-to-all performance, we want to improve all-reduce latency in same scale. So I think NPKit will help further.
from msccl.
In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try! Also, try NPKit profiler for further analysis of your algorithm!
Hi saeedmaleki, we have tried to modify the 3-step algo in this way:
But the nccl-test result seems not to behave better than NCCL. We also profiled the algo with NPKIT. The trace result of NPKIT is hard to understand. Could you please have a look at this result? We tested the result on 2 nodes(16 GPUs). Thank you!
trace.zip
from msccl.
NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?
Here is the result of nccl-tests(using origin NCCL) on 64xGPU for a large range from 1KB to 4GB:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 4 float none -1 3343.1 0.00 0.00 0 560.6 0.00 0.00 N/A
2048 8 float none -1 165.6 0.01 0.01 0 161.9 0.01 0.01 N/A
4096 16 float none -1 160.8 0.03 0.03 0 162.2 0.03 0.02 N/A
8192 32 float none -1 160.9 0.05 0.05 0 161.3 0.05 0.05 N/A
16384 64 float none -1 160.9 0.10 0.10 0 160.3 0.10 0.10 N/A
32768 128 float none -1 161.1 0.20 0.20 0 161.2 0.20 0.20 N/A
65536 256 float none -1 173.3 0.38 0.37 0 181.3 0.36 0.36 N/A
131072 512 float none -1 211.3 0.62 0.61 0 209.6 0.63 0.62 N/A
262144 1024 float none -1 243.4 1.08 1.06 0 242.2 1.08 1.07 N/A
524288 2048 float none -1 416.9 1.26 1.24 0 416.9 1.26 1.24 N/A
1048576 4096 float none -1 821.2 1.28 1.26 0 815.6 1.29 1.27 N/A
2097152 8192 float none -1 1601.1 1.31 1.29 0 1668.1 1.26 1.24 N/A
4194304 16384 float none -1 3208.9 1.31 1.29 0 3200.8 1.31 1.29 N/A
8388608 32768 float none -1 6545.6 1.28 1.26 0 6629.6 1.27 1.25 N/A
16777216 65536 float none -1 12890 1.30 1.28 0 12827 1.31 1.29 N/A
33554432 131072 float none -1 25914 1.29 1.27 0 25847 1.30 1.28 N/A
67108864 262144 float none -1 52074 1.29 1.27 0 52211 1.29 1.27 N/A
134217728 524288 float none -1 103897 1.29 1.27 0 103746 1.29 1.27 N/A
268435456 1048576 float none -1 240116 1.12 1.10 0 206585 1.30 1.28 N/A
536870912 2097152 float none -1 413198 1.30 1.28 0 475190 1.13 1.11 N/A
1073741824 4194304 float none -1 823209 1.30 1.28 0 823260 1.30 1.28 N/A
2147483648 8388608 float none -1 1647891 1.30 1.28 0 1646527 1.30 1.28 N/A
4294967296 16777216 float none -1 3295211 1.30 1.28 0 3299375 1.30 1.28 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 0.872544
from msccl.
In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try! Also, try NPKit profiler for further analysis of your algorithm!
Hi saeedmaleki, we have tried to modify the 3-step algo in this way:
But the nccl-test result seems not to behave better than NCCL. We also profiled the algo with NPKIT. The trace result of NPKIT is hard to understand. Could you please have a look at this result? We tested the result on 2 nodes(16 GPUs). Thank you! trace.zip
This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.
Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?
from msccl.
NCCL gives you 2.15 but theoretically (based purely on bandwidth) you should be able to get around 3.4 and the reason for this gap is because of the suboptimal algorithm NCCL is using for your topology. Can you please share your 64xGPU results for a large range from 1KB to 4GB (or however large it can run)?
Here is the result of nccl-tests(using origin NCCL) on 64xGPU for a large range from 1KB to 4GB:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 4 float none -1 3343.1 0.00 0.00 0 560.6 0.00 0.00 N/A 2048 8 float none -1 165.6 0.01 0.01 0 161.9 0.01 0.01 N/A 4096 16 float none -1 160.8 0.03 0.03 0 162.2 0.03 0.02 N/A 8192 32 float none -1 160.9 0.05 0.05 0 161.3 0.05 0.05 N/A 16384 64 float none -1 160.9 0.10 0.10 0 160.3 0.10 0.10 N/A 32768 128 float none -1 161.1 0.20 0.20 0 161.2 0.20 0.20 N/A 65536 256 float none -1 173.3 0.38 0.37 0 181.3 0.36 0.36 N/A 131072 512 float none -1 211.3 0.62 0.61 0 209.6 0.63 0.62 N/A 262144 1024 float none -1 243.4 1.08 1.06 0 242.2 1.08 1.07 N/A 524288 2048 float none -1 416.9 1.26 1.24 0 416.9 1.26 1.24 N/A 1048576 4096 float none -1 821.2 1.28 1.26 0 815.6 1.29 1.27 N/A 2097152 8192 float none -1 1601.1 1.31 1.29 0 1668.1 1.26 1.24 N/A 4194304 16384 float none -1 3208.9 1.31 1.29 0 3200.8 1.31 1.29 N/A 8388608 32768 float none -1 6545.6 1.28 1.26 0 6629.6 1.27 1.25 N/A 16777216 65536 float none -1 12890 1.30 1.28 0 12827 1.31 1.29 N/A 33554432 131072 float none -1 25914 1.29 1.27 0 25847 1.30 1.28 N/A 67108864 262144 float none -1 52074 1.29 1.27 0 52211 1.29 1.27 N/A 134217728 524288 float none -1 103897 1.29 1.27 0 103746 1.29 1.27 N/A 268435456 1048576 float none -1 240116 1.12 1.10 0 206585 1.30 1.28 N/A 536870912 2097152 float none -1 413198 1.30 1.28 0 475190 1.13 1.11 N/A 1073741824 4194304 float none -1 823209 1.30 1.28 0 823260 1.30 1.28 N/A 2147483648 8388608 float none -1 1647891 1.30 1.28 0 1646527 1.30 1.28 N/A 4294967296 16777216 float none -1 3295211 1.30 1.28 0 3299375 1.30 1.28 N/A # Out of bounds values : 0 OK # Avg bus bandwidth : 0.872544
These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.
Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.
from msccl.
There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?
from msccl.
This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.
Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?
Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.
from msccl.
This looks like a good algorithm. I don't expect it to do much differently for two nodes. But I believe you should see better results for larger scale.
Also, the trace file seems to be only for NCCL default algorithm. What size did you run to get this trace file?Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.
Then everything should have worked just fine. Can you please share your 3-step algorithm so that I can check it from my side? There might be some bug in MSCCL which we need to fix.
from msccl.
These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.
Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.
Here is the all-gather result of 2 nodes on a long range of sizes:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1024 16 float none -1 60.47 0.02 0.02 0 36.20 0.03 0.03 0
2048 32 float none -1 36.32 0.06 0.05 0 36.16 0.06 0.05 0
4096 64 float none -1 36.12 0.11 0.11 0 36.24 0.11 0.11 0
8192 128 float none -1 36.64 0.22 0.21 0 36.48 0.22 0.21 0
16384 256 float none -1 37.94 0.43 0.40 0 37.50 0.44 0.41 0
32768 512 float none -1 40.81 0.80 0.75 0 40.58 0.81 0.76 0
65536 1024 float none -1 48.38 1.35 1.27 0 48.37 1.35 1.27 0
131072 2048 float none -1 55.21 2.37 2.23 0 50.25 2.61 2.45 0
262144 4096 float none -1 137.2 1.91 1.79 0 137.3 1.91 1.79 0
524288 8192 float none -1 148.1 3.54 3.32 0 146.8 3.57 3.35 0
1048576 16384 float none -1 171.5 6.11 5.73 0 168.0 6.24 5.85 0
2097152 32768 float none -1 204.1 10.27 9.63 0 203.8 10.29 9.65 0
4194304 65536 float none -1 300.7 13.95 13.07 0 296.9 14.12 13.24 0
8388608 131072 float none -1 489.4 17.14 16.07 0 488.3 17.18 16.10 0
16777216 262144 float none -1 971.2 17.27 16.20 0 965.4 17.38 16.29 0
33554432 524288 float none -1 2006.8 16.72 15.68 0 1962.3 17.10 16.03 0
67108864 1048576 float none -1 3956.8 16.96 15.90 0 3924.3 17.10 16.03 0
134217728 2097152 float none -1 7882.9 17.03 15.96 0 7843.1 17.11 16.04 0
268435456 4194304 float none -1 15505 17.31 16.23 0 15366 17.47 16.38 0
536870912 8388608 float none -1 31401 17.10 16.03 0 31144 17.24 16.16 0
1073741824 16777216 float none -1 61484 17.46 16.37 0 59921 17.92 16.80 0
2147483648 33554432 float none -1 120583 17.81 16.70 0 119265 18.01 16.88 0
4294967296 67108864 float none -1 240501 17.86 16.74 0 235906 18.21 17.07 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.76973
from msccl.
These numbers are a bit unexpected. Are you sure that your IB is 200Gbps? With 200Gbps, theoretically, you should be seeing 3.57GBps algBW. If we consider 90% efficiency of IB, we should be seeing around 3.21GBps and that is a 2.5x missing performance.
Can you please run AllGather on two nodes on a long range of sizes and share the numbers? That way we can be sure about the BW of the IB.Here is the all-gather result of 2 nodes on a long range of sizes:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1024 16 float none -1 60.47 0.02 0.02 0 36.20 0.03 0.03 0 2048 32 float none -1 36.32 0.06 0.05 0 36.16 0.06 0.05 0 4096 64 float none -1 36.12 0.11 0.11 0 36.24 0.11 0.11 0 8192 128 float none -1 36.64 0.22 0.21 0 36.48 0.22 0.21 0 16384 256 float none -1 37.94 0.43 0.40 0 37.50 0.44 0.41 0 32768 512 float none -1 40.81 0.80 0.75 0 40.58 0.81 0.76 0 65536 1024 float none -1 48.38 1.35 1.27 0 48.37 1.35 1.27 0 131072 2048 float none -1 55.21 2.37 2.23 0 50.25 2.61 2.45 0 262144 4096 float none -1 137.2 1.91 1.79 0 137.3 1.91 1.79 0 524288 8192 float none -1 148.1 3.54 3.32 0 146.8 3.57 3.35 0 1048576 16384 float none -1 171.5 6.11 5.73 0 168.0 6.24 5.85 0 2097152 32768 float none -1 204.1 10.27 9.63 0 203.8 10.29 9.65 0 4194304 65536 float none -1 300.7 13.95 13.07 0 296.9 14.12 13.24 0 8388608 131072 float none -1 489.4 17.14 16.07 0 488.3 17.18 16.10 0 16777216 262144 float none -1 971.2 17.27 16.20 0 965.4 17.38 16.29 0 33554432 524288 float none -1 2006.8 16.72 15.68 0 1962.3 17.10 16.03 0 67108864 1048576 float none -1 3956.8 16.96 15.90 0 3924.3 17.10 16.03 0 134217728 2097152 float none -1 7882.9 17.03 15.96 0 7843.1 17.11 16.04 0 268435456 4194304 float none -1 15505 17.31 16.23 0 15366 17.47 16.38 0 536870912 8388608 float none -1 31401 17.10 16.03 0 31144 17.24 16.16 0 1073741824 16777216 float none -1 61484 17.46 16.37 0 59921 17.92 16.80 0 2147483648 33554432 float none -1 120583 17.81 16.70 0 119265 18.01 16.88 0 4294967296 67108864 float none -1 240501 17.86 16.74 0 235906 18.21 17.07 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 8.76973
OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).
from msccl.
OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).
Okay. Thanks!
alltoall_a100_three_step.py.zip
from msccl.
There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?
It could have some fluctuations, because we tested in a cluster and the machines were not fixed. BTW, the IBs are not all 200Gbps, each node has 3 IBs. Here is the result of 'ibstatus' in a node:
from msccl.
There is also an inconsistency with 1.30GBps algBW at 4GB input size on 64xGPUs. Earlier you got >2GBps for the same setting as I can see from the logs you shared. Right?
It could have some fluctuations, because we tested in a cluster and the machines were not fixed. BTW, the IBs are not all 200Gbps, each node has 3 IBs. Here is the result of 'ibstatus' in a node:
You need to look at NCCL log NCCL_DEBUG=INFO
to see which one it uses. My guess is that it will only use the fastest one since I think (could be wrong here) your PCIe bandwidth can at max draw 200Gbps.
from msccl.
Therefore, your AllToAll can be hugely improved by utilizing the right algorithm.
wow, great news. May you give some hints on it? we want to use small chunks on ib to make it faster, but seems currently msccl use fixed chunks in all2all?
from msccl.
@saeedmaleki long time no see!
may you give some hints on it, we really needs to improve all2all latency, currently it consumes 78% for one moe layers.
from msccl.
Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things:
(1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations.
(2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.
Do these steps make sense?
from msccl.
hi @saeedmaleki , great thanks for your advice, there is two nodes NPKit result: https://github.com/Musisoul/NPKIT-results/blob/main/trace.zip. we will try one GPU per node on 8 nodes.
from msccl.
It seems that this is still using default NCCL algorithm. Did you try the algorithm you developed for this run?
from msccl.
It seems that this is still using default NCCL algorithm. Did you try the algorithm you developed for this run?
Well, we did use modified 3-step MSCCL algo because "Connected 1 MSCCL algorithm" was in the log. The size is only 256MiB.
Maybe we should try another time, but the previous NPKIT result was generated by modified 3-step algo. We will also try one GPU per node on 8 nodes. These tests may take some time.
from msccl.
Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things: (1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations. (2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.
Do these steps make sense?
Sorry for not replying for a long time. We have tried all-to-all test with one GPU per node on 8 nodes. Here is the result:
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
1048576 32768 float none -1 2522.4 0.42 0.36 0 149.6 7.01 6.13 N/A
2097152 65536 float none -1 249.5 8.40 7.35 0 248.3 8.45 7.39 N/A
4194304 131072 float none -1 453.7 9.24 8.09 0 427.6 9.81 8.58 N/A
8388608 262144 float none -1 790.4 10.61 9.29 0 1273.5 6.59 5.76 N/A
16777216 524288 float none -1 1511.7 11.10 9.71 0 1597.4 10.50 9.19 N/A
33554432 1048576 float none -1 3115.8 10.77 9.42 0 3281.4 10.23 8.95 N/A
67108864 2097152 float none -1 6121.0 10.96 9.59 0 6137.1 10.93 9.57 N/A
134217728 4194304 float none -1 12126 11.07 9.69 0 12267 10.94 9.57 N/A
268435456 8388608 float none -1 24781 10.83 9.48 0 23934 11.22 9.81 N/A
536870912 16777216 float none -1 48064 11.17 9.77 0 47966 11.19 9.79 N/A
1073741824 33554432 float none -1 95345 11.26 9.85 0 95303 11.27 9.86 N/A
# Out of bounds values : 0 OK
# Avg bus bandwidth : 8.5103
The log is attached below:
test-8node8gpu.log
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip
We found that when using tools/npkit_trace_generator.py to generate json from npkit_dump_dir, there were some bugs in the python script:
Traceback (most recent call last):
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 232, in <module>
convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace
gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 117, in parse_gpu_event_file
gpu_events[-1]['args']['bw (GB/s)'] = gpu_events[-1]['args']['size'] / delta_time / 1e3
ZeroDivisionError: float division by zero
Traceback (most recent call last):
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 233, in <module>
convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 212, in convert_npkit_dump_to_trace
gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 116, in parse_gpu_event_file
delta_time = gpu_events[-1]['ts'] - gpu_events[-2]['ts']
IndexError: list index out of range
Traceback (most recent call last):
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 235, in <module>
convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 214, in convert_npkit_dump_to_trace
gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale)
File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file
'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time,
TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
We circumvented these bugs and generated the json. Previously you said "it seems that this is still using default NCCL algorithm", will these bugs affect the results? Thank you!
from msccl.
OK great! This is >12.5GBps which means that your IB's bw is close to 25GBps (or 200Gbps). Therefore, your AllToAll can be hugely improved by utilizing the right algorithm. Please share you 3-step algorithm so that I can take a look :).
Okay. Thanks! alltoall_a100_three_step.py.zip
Previously we provided you the modified three-step algo, what is the performance of the algo on your machines? Does this code need to be improved or has some bugs? We are looking forward for your reply. Thank you!
from msccl.
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zip
Update: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result:
npkit_event_trace_20230108.zip
from msccl.
Hi @Jack47, I took a look at your algorithm and it seems that the local node communication has too many steps which can be optimized. To evaluate this, I would suggest two things: (1) try to get NPKit to generate the json files and generate a trace so that we can study where your performance is going. I am not entirely sure why you couldn't get it to work, if there is a bug, please let us know. You could first look at 2-node alltoall with MSCCL and see if that makes a difference. Note that NPKit generates really large log files, so please be mindful and do not run nccl-tests with too many iterations. (2) trying all-to-all with one GPU per node on 8 nodes. You could use default NCCL's algorithm to study the performance. That gives us a good idea if the cross-node communication has any bottlenecks.
Do these steps make sense?Sorry for not replying for a long time. We have tried all-to-all test with one GPU per node on 8 nodes. Here is the result:
# out-of-place in-place # size count type redop root time algbw busbw #wrong time algbw busbw #wrong # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 1048576 32768 float none -1 2522.4 0.42 0.36 0 149.6 7.01 6.13 N/A 2097152 65536 float none -1 249.5 8.40 7.35 0 248.3 8.45 7.39 N/A 4194304 131072 float none -1 453.7 9.24 8.09 0 427.6 9.81 8.58 N/A 8388608 262144 float none -1 790.4 10.61 9.29 0 1273.5 6.59 5.76 N/A 16777216 524288 float none -1 1511.7 11.10 9.71 0 1597.4 10.50 9.19 N/A 33554432 1048576 float none -1 3115.8 10.77 9.42 0 3281.4 10.23 8.95 N/A 67108864 2097152 float none -1 6121.0 10.96 9.59 0 6137.1 10.93 9.57 N/A 134217728 4194304 float none -1 12126 11.07 9.69 0 12267 10.94 9.57 N/A 268435456 8388608 float none -1 24781 10.83 9.48 0 23934 11.22 9.81 N/A 536870912 16777216 float none -1 48064 11.17 9.77 0 47966 11.19 9.79 N/A 1073741824 33554432 float none -1 95345 11.26 9.85 0 95303 11.27 9.86 N/A # Out of bounds values : 0 OK # Avg bus bandwidth : 8.5103
The log is attached below: test-8node8gpu.log
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result: npkit_event_trace_20230107.zip
We found that when using tools/npkit_trace_generator.py to generate json from npkit_dump_dir, there were some bugs in the python script:
Traceback (most recent call last): File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 232, in <module> convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 211, in convert_npkit_dump_to_trace gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 117, in parse_gpu_event_file gpu_events[-1]['args']['bw (GB/s)'] = gpu_events[-1]['args']['size'] / delta_time / 1e3 ZeroDivisionError: float division by zero Traceback (most recent call last): File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 233, in <module> convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 212, in convert_npkit_dump_to_trace gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 116, in parse_gpu_event_file delta_time = gpu_events[-1]['ts'] - gpu_events[-2]['ts'] IndexError: list index out of range Traceback (most recent call last): File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 235, in <module> convert_npkit_dump_to_trace(args.input_dir, args.output_dir, npkit_event_def) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 214, in convert_npkit_dump_to_trace gpu_events = parse_gpu_event_file(npkit_dump_dir, npkit_event_def, rank, buf_idx, gpu_clock_scale, cpu_clock_scale) File "/mnt/lustre/chendingyu1/msccl-master-npkit/tools/npkit_trace_generator.py", line 95, in parse_gpu_event_file 'ts': curr_cpu_base_time + parsed_gpu_event['timestamp'] / gpu_clock_scale - curr_gpu_base_time, TypeError: unsupported operand type(s) for +: 'NoneType' and 'float'
We circumvented these bugs and generated the json. Previously you said "it seems that this is still using default NCCL algorithm", will these bugs affect the results? Thank you!
This is your key results pinpointing the problem. You are getting 9.86 GBps busBW which is a good 2x off. it should have been ~ 22-25 GBps which is your IB's BW. You might have a bad node in the system. I suggest narrowing down the experiment to 2 and 4 GPUs per experiment to find the problematic node. As far as I remember your AllGather result had great numbers, so it seems like something is off.
Without fixing this issue, maximum busBW you may get on 64 GPUs is 9.85/8~1.23 GBps which is way below what it needs to be.
Please let me know of your investigation and we can find the problematic node.
from msccl.
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zipUpdate: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip
I couldn't open the zip file, please reupload it?
from msccl.
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zipUpdate: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip
I couldn't open the zip file, please reupload it?
I can unzip the npkit_event_trace_20230108.zip, so should I upload the json directly?
from msccl.
#48 (comment)
When I use the forked repo and run
"make MPI=1 NCCL_HOME=../msccl/build/ -j'",
why do I get the error as
'alltoall.cu(72): error: identifier "ncclAllToAll" is undefined'.
Seems nccl-test doesn't refer to msccl correctly.
from msccl.
We have tried the NPKIT on 2 nodes using our modified three-step algo. Here is the result:
npkit_event_trace_20230107.zipUpdate: This file could be too large for view tracer to open, we retested the modified three-step algo with smaller iterations and got this result: npkit_event_trace_20230108.zip
I couldn't open the zip file, please reupload it?
I can unzip the npkit_event_trace_20230108.zip, so should I upload the json directly?
Oops sorry I dropped the ball on this one. Yes please
from msccl.
#48 (comment) When I use the forked repo and run "make MPI=1 NCCL_HOME=../msccl/build/ -j'", why do I get the error as 'alltoall.cu(72): error: identifier "ncclAllToAll" is undefined'.
Seems nccl-test doesn't refer to msccl correctly.
I suggest trying giving it an absolute path instead of relative.
from msccl.
There is the answer, it's connected to GPU 0 and 1 over a PCIe switch.
Regarding the proper algorithm for your use case: you might wanna use something like the 3-step algorithm for alltoall except that the relay logic needs to change. This means that we need to (1) gather cross-node traffic to a local GPU (let's say local GPU 0) and (2) then GPU 0 sends cross-node data to local GPU 0 on other nodes. (3) Lastly, GPU 0 scatters the data to everyone. Thus, the name 3-step algorithm.
Cool! We can simply modify CrossNodeGpus to just use local rank 0 to implement the algo you mentioned?
Exactly! BTW, I am not sure if it will work better for you, but just judging from my intuitions.
By the way, what's the difference between alltoall two step and three step? two step : (1) gather cross-node traffic to a local GPU and (2) then local GPU sends cross-node data to local GPU peers on other nodes. so in two step we don't group chunks into a larger one?
So, in the two-step algorithm, all of the cross-node traffic, for example, to local GPU 6 on node 0 is via local GPU 6 on other nodes. So, local GPU i on node j aggregates the traffic from other local GPUs on node j that needs to go to other local GPU i on other nodes.
Hi, I see from NCCL 2.12, the default NCCL primitives have used PXN to optimize cross-node traffic. PXN will also aggregate cross-node messages to the same local GPU and then send them with the shared connections. Thus, Is there any difference between MSCCL two-step algorithm with PXN optimization in NCCL?
In three-step algorithm, the aggregation is more aggressive, and it was only beneficial in our experiments at 64 node scale and larger. In this algorithm, node i and node j communicate via only one local GPU on each (the CrossNodeGpus logigc). In your topology, because you have only a single IB, the three-step algorithm with a more aggressive aggregation would make more sense. We will never know until we try!
Also, try NPKit profiler for further analysis of your algorithm!
from msccl.
From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.
from msccl.
From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.
If I understand correctly, in 3-step algorithm, only one GPU's NIC will be used to exchange cross-node traffic. If one node has 8GPU and 8 NIC,the available bandwidth will be decreased to 1/8. Could you show more experience why 3-step algorithm could improve performance after 1024 GPUs?
from msccl.
From our experience, there are still differences at 64 GPUs and up but PXN does a pretty good job as well. But after 1024 GPUs, you will need to switch to the 3-step algorithm. I think PXN is disabled after certain number of GPUs.
Is there any experiment that could show that 3-step algorithm could perform better than PXN in certain number of GPU or message size? In my experiment, I conducted an experiment with 96 GPUs (12 node8 * 8), and the result shows that 3-step algorithm performs really worse then native alltoall implemented with send/recv .
Native alltoall with PXN:
3-step algorithm with PXN disable:
3-step algorithm MSCCL_XML_FILES:
from msccl.
Related Issues (20)
- chained reductions HOT 3
- Datacheck fails only with `LL` protocol HOT 18
- `re` step runs significantly slower compared against an earlier version HOT 4
- Schedule fails datacheck in `origin/reduction_in_prims` HOT 1
- TACCL code publishing plan HOT 2
- nvlink errors when building msccl HOT 11
- Multi-gpu NCCL test test_all_reduce_coalesced_nccl failing HOT 3
- Performance of msccl did not improve compared with nccl HOT 4
- Schedule fails at large size HOT 33
- Interesting dip in runtime HOT 3
- Questions about MSCCL's building error HOT 1
- All-to-all correctness check does not seem to work HOT 2
- nccl baseline HOT 3
- Custom Ring All-Reduce between GPU0 and and GPU 7 is not compiled HOT 1
- Grammar error in README.md
- How to set the target buffer size range for MSCCL algorithms HOT 2
- Compilation failure HOT 3
- What is the difference between microsoft/msccl and Azure/msccl-executor-nccl? HOT 3
- How to use test.xml algo.? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from msccl.