Comments (5)
yaroslavvb@ Thanks for summarizing your efforts. Have you given nccl-tests busbw measurement a try while scaling nodes? If yes, could you please post some numbers here. (along with the command line you used to invoke them)
from aws-ofi-nccl.
The links to nccl-test logs and commands are summarized in this page:
NVIDIA/nccl#235 (comment)
Note that nccl-test busbw is incorrectly reported for 2 machines, so I rely on algbw instead use formula in doc above to convert between busbw and algbw
For instance copying from 32-machine run log page
Running command:
/opt/amazon/efa/bin/mpirun -n 256 -N 8 --hostfile hosts.slots -x FI_EFA_MR_CACHE_ENABLE="1" -x FI_OFI_RXR_INLINE_MR_ENABLE="1" -x FI_OFI_RXR_RX_COPY_OOO="1" -x FI_OFI_RXR_RX_COPY_UNEXP="1" -x FI_PROVIDER="efa" -x LD_LIBRARY_PATH="/usr/local/cuda/lib:/usr/local/cuda/lib64:/opt/amazon/efa/lib64" -x NCCL_DEBUG="INFO" -x NCCL_TREE_THRESHOLD="42949672960" --mca btl tcp,self --mca btl_tcp_if_exclude lo,docker0 -mca orte_base_help_aggregate 0 --bind-to none $HOME/packages/nccl-tests/build/all_reduce_perf -b 8 -e 4096M -f 2 -g 1 -c 1 -n 100
#
# out-of-place in-place
# size count type redop time algbw busbw error time algbw busbw error
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
ip-172-31-43-246:32158:32255 [6] NCCL INFO comm 0x7f7950002590 rank 86 nranks 256 cudaDev 6 nvmlDev 6 - Init COMPLETE
8 2 float sum 251.9 0.00 0.00 1e-06 247.8 0.00 0.00 1e-06
16 4 float sum 246.8 0.00 0.00 5e-07 246.9 0.00 0.00 5e-07
32 8 float sum 247.3 0.00 0.00 1e-06 249.0 0.00 0.00 1e-06
64 16 float sum 250.0 0.00 0.00 1e-06 249.3 0.00 0.00 1e-06
128 32 float sum 251.6 0.00 0.00 1e-06 250.6 0.00 0.00 1e-06
256 64 float sum 256.5 0.00 0.00 1e-06 255.7 0.00 0.00 1e-06
512 128 float sum 265.4 0.00 0.00 1e-06 264.0 0.00 0.00 1e-06
1024 256 float sum 273.1 0.00 0.01 1e-06 273.6 0.00 0.01 1e-06
2048 512 float sum 291.2 0.01 0.01 1e-06 291.5 0.01 0.01 1e-06
4096 1024 float sum 355.4 0.01 0.02 1e-06 353.9 0.01 0.02 1e-06
8192 2048 float sum 896.9 0.01 0.02 1e-06 828.2 0.01 0.02 1e-06
16384 4096 float sum 1222.0 0.01 0.03 1e-06 1065.7 0.02 0.03 1e-06
32768 8192 float sum 1197.6 0.03 0.05 1e-06 1092.5 0.03 0.06 1e-06
65536 16384 float sum 1308.3 0.05 0.10 1e-06 1159.2 0.06 0.11 1e-06
131072 32768 float sum 1388.7 0.09 0.19 2e-06 1190.2 0.11 0.22 2e-06
262144 65536 float sum 1357.0 0.19 0.38 2e-06 1356.2 0.19 0.39 2e-06
524288 131072 float sum 1620.1 0.32 0.64 2e-06 1612.9 0.33 0.65 2e-06
1048576 262144 float sum 2157.4 0.49 0.97 2e-06 2161.8 0.49 0.97 2e-06
2097152 524288 float sum 2146.4 0.98 1.95 2e-06 2147.8 0.98 1.95 2e-06
4194304 1048576 float sum 2928.8 1.43 2.85 2e-06 2937.5 1.43 2.84 2e-06
8388608 2097152 float sum 4641.1 1.81 3.60 2e-06 4636.5 1.81 3.60 2e-06
16777216 4194304 float sum 7001.4 2.40 4.77 2e-06 6991.1 2.40 4.78 2e-06
33554432 8388608 float sum 11609 2.89 5.76 2e-06 11626 2.89 5.75 2e-06
67108864 16777216 float sum 21008 3.19 6.36 2e-06 21036 3.19 6.36 2e-06
134217728 33554432 float sum 39490 3.40 6.77 2e-06 39524 3.40 6.77 2e-06
268435456 67108864 float sum 79617 3.37 6.72 2e-06 79562 3.37 6.72 2e-06
536870912 134217728 float sum 166922 3.22 6.41 2e-06 160034 3.35 6.68 2e-06
1073741824 268435456 float sum 311080 3.45 6.88 2e-06 311824 3.44 6.86 2e-06
2147483648 536870912 float sum 633049 3.39 6.76 2e-06 671957 3.20 6.37 2e-06
4294967296 1073741824 float sum 1226477 3.50 6.98 2e-06 1215340 3.53 7.04 2e-06
from aws-ofi-nccl.
Please re-open issue if this still happens.
from aws-ofi-nccl.
Is this closed because the numbers reported above are what is expected on EFA?
from aws-ofi-nccl.
We expect a busbw of ~10 GB/s on AWS EC2 when instances are launched within same placement group.
from aws-ofi-nccl.
Related Issues (20)
- Mellanox and EFA in Docker Image HOT 6
- NCCL WARN NET/OFI Only EFA provider is supported HOT 2
- potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69 HOT 1
- aws branch does not build on centos 7 with gcc 4.8.5 HOT 2
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Misleading comparison on unsigned integer
- Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older. HOT 11
- Unable to find libcudart.so (1.7.1) HOT 6
- Running nccl-perf tests documentation is missing MPI instructions HOT 3
- What are some AI/ML workloads users can utilize to test performance of the plugin?
- Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts HOT 4
- Support Amazon Linux 2023 (AL2023) HOT 2
- Support Red Hat Enterprise Linux 9+ HOT 4
- Add more examples with more recent cuda versions HOT 2
- Topology Discovery Regression HOT 2
- GPU direct HOT 1
- NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4 HOT 6
- Segfault after/during finalize with OpenMPI HOT 2
- Propagate "Invalid address" to NCCL communicator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.