Comments (6)
Thanks Dmitry for the patch again and we apologize for overriding the previous changes. I have reviewed the patch and have one critical feedback.
The major blocker I see in testing psm3 provider is availability/access to appropriate hardware. Does your team have any continuous integration system that our PRs can use? This will help us prevent such issues in future.
from aws-ofi-nccl.
Hello Rashika (@rashikakheria),
May I ask whether your test\CI system have RoCE or IB (InfiniBand) network card?
If so, PSM3 can be tested there out of the box, it does not require Intel NICs.
PSM3 is fully opensource and shipped with libfabric.
Accessing to our team's test system is a tough topic, I would like to discuss easier approaches first.
BRs,
Denis
from aws-ofi-nccl.
May I ask whether your test\CI system have RoCE or IB (InfiniBand) network card?
No, we don't have access to systems with IB or RoCE network card.
from aws-ofi-nccl.
PR merged.
from aws-ofi-nccl.
May I ask whether your test\CI system have RoCE or IB (InfiniBand) network card?
No, we don't have access to systems with IB or RoCE network card.
Hello Rashika (@rashikakheria),
Sorry for late reply. Unfortunately accessing to our team internal CI infrastructure from outside is a topic required thorough discussion (primarily with infosec representatives). We initiated discussion, but currently the answer is that we are not allowed to do this. I will update if something changes, but chances are low.
from aws-ofi-nccl.
Hello Rashika (@rashikakheria), Sorry for late reply. Unfortunately accessing to our team internal CI infrastructure from outside is a topic required thorough discussion (primarily with infosec representatives). We initiated discussion, but currently the answer is that we are not allowed to do this. I will update if something changes, but chances are low.
Thanks for getting back! One idea is that you could possibly pull in the changes from aws-ofi-nccl
repository for every PR and test it on your end. You could comment on the PRs if your CI system break (until the time this repository has direct access to your testing).
from aws-ofi-nccl.
Related Issues (20)
- Mellanox and EFA in Docker Image HOT 6
- NCCL WARN NET/OFI Only EFA provider is supported HOT 2
- potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69 HOT 1
- aws branch does not build on centos 7 with gcc 4.8.5 HOT 2
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Misleading comparison on unsigned integer
- Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older. HOT 11
- Unable to find libcudart.so (1.7.1) HOT 6
- Running nccl-perf tests documentation is missing MPI instructions HOT 3
- What are some AI/ML workloads users can utilize to test performance of the plugin?
- Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts HOT 4
- Support Amazon Linux 2023 (AL2023) HOT 2
- Support Red Hat Enterprise Linux 9+ HOT 4
- Add more examples with more recent cuda versions HOT 2
- Topology Discovery Regression HOT 2
- GPU direct HOT 1
- NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4 HOT 6
- Segfault after/during finalize with OpenMPI HOT 2
- Propagate "Invalid address" to NCCL communicator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.