Comments (2)
When there're multiple Ethernet interfaces available, there won't be "default", OMPI will automatically detect the usable interface using its routability algorithm. The detection is complex and it still cannot cover some cases which require user to specify btl_tcp_if_include
and btl_tcp_if_exclude
explicitly.
You can pass through such parameters in config yaml:
modes:
- name: mpi
mca:
btl_tcp_if_include: azure2
For the proposed get_eth_interfaces
detection, it cannot cover some scenarios as well and may not be better than current OMPI's detection. Here're two cases:
-
"MPI job that spans public and private networks" that OMPI cannot cover, consider the following two nodes
| Node A | | Node B | | | | | | eth0 | | eth0 | # public network, cannot route to each other | 192.168.100.1 | | 192.168.100.2 | | | | | | eth1 | | eth1 | # private network, can route to each other | 192.168.200.1 | | 192.168.200.2 |
eth0 is only used to access Internet and they cannot route to each other, while eth1 of two nodes are connected to the same switch in private networking. You detection method would think both of them are usable (ipv4, with ip, no bridge, not docker), it's hard to choose the correct one without extra knowledge for cluster networking.
-
When all ethernet interfaces (azure2 and eth0 in your case) are disabled or unavailable, IPoIB can also be used for mpi, for example, changing to
btl_tcp_if_include ib0
should also work in your environment. Theget_eth_interfaces
detection would also exclude suitable interfaces.
from superbenchmark.
will use one by setting the config file
from superbenchmark.
Related Issues (20)
- V0.8.0 Release Plan
- V0.8.0 Test Plan
- [Enhancement] - Add HPL random generator to gemm-flops with ROCm
- pytorch cannot find libopen-orted-mpir.so HOT 2
- Run benchmark failed (superbenchmark-0.8.0) HOT 2
- superbench failed at default most typical run config HOT 8
- why is it probing for nviida when running on MI? HOT 7
- Some test does not support CS 8.9(RTX 4080/4090) HOT 2
- A question about Hived HOT 5
- cublaslt_gemm fp8 does not work with RTX 40 HOT 4
- sb deploy fails due to permission issue, HOT 10
- Superbench result contains null characters. HOT 1
- out-of-date reference link HOT 1
- V0.9.0 Test Plan
- Error: parsing sudo passwords containing special symbols HOT 1
- V0.10.0 Release Plan
- Urgent: while executing the superbench, it's failing (UBUNTU) HOT 2
- default gpu_burn test fails with cp error HOT 2
- Not able to test IB-TRAFFIC for NDR(H-100) Cluster HOT 3
- V0.10.0 Test Plan
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from superbenchmark.