Comments (2)
The reason plugin code sets the topo file explicitly is to ensure that right topology files are used when running on AWS systems. Now there is a chance that if you build it separately and copy the library to other location, we might miss porting the topology file leading to performance regression. Similarly overriding the NCCL_TOPO_FILE
variable from customer configuration scripts, might also lead to performance regression in case the file doesn't exist. These customer configuration issues are hard to chase when debugging performance. Is it possible to build plugin in the conda environment?
from aws-ofi-nccl.
#173 should fix the issue.
from aws-ofi-nccl.
Related Issues (20)
- Crash with multirail providers. HOT 1
- [Question] Is RDMA available on p3dn instances? HOT 7
- Training performance on p3dn.24xlarge on Amazon Linux 2 is worse than on Ubuntu 20.04 (with and without EFA) HOT 2
- Error (and crash) when using EFA from docker running on ubuntu AMI HOT 2
- Question - difference between main vs aws branches HOT 1
- PyTorch Distributed Training crashes with "Cannot allocate memory (-12)" HOT 14
- How does ofi_iflush() work? HOT 6
- aws-ofi-nccl makes unnecessary calls to ofi_iflush() when using the PSM3 transport. HOT 2
- Error running NCCL Tests via MPiJob: "prov_err Not a Directory" HOT 2
- NCCL WARN NET/OFI Request completed with error. RC: 21. Error: unknown error HOT 2
- WARNING: unrecognized options: --with-nccl when attempting to install HOT 10
- Mellanox and EFA in Docker Image HOT 6
- NCCL WARN NET/OFI Only EFA provider is supported HOT 2
- potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69 HOT 1
- aws branch does not build on centos 7 with gcc 4.8.5 HOT 2
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Restore PSM3 transport for libfabric in .travis.yml HOT 4
- Missed ability to not use memory registering if provider does not request it HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.