Comments (9)
This absence is specific to Aws-ofi-nccl's RDMA protocol implementation. Context was omitted only for convenience in our first implementation, and can be added in the future.
from aws-ofi-nccl.
Hi @rauteric,
Thank you for the clarification. Our team is considering contributing a fix to introduce FI_CONTEXT
support with RDMA and would love to get your thoughts on any implementation preferences or directions your team might have. This would help us tailor our contribution to fit into the project.
from aws-ofi-nccl.
Hi @eliekozah-cornelisnetworks,
Thank you for considering a contribution. In general, I think FI_CONTEXT
support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having a fi_context
member of the request data structure.
The main obstacle we have currently to supporting FI_CONTEXT
is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 4226 to 4230 in 2adb8f5
req
). To support FI_CONTEXT
these would need to be split either into separate fi_context
objects inside the request structure, or preferably, subrequests of the parent request structure.
There may be other examples of this in the current code, but that's the only one I can find at the moment.
from aws-ofi-nccl.
Hi @rauteric,
I see you removed the check of provider requiring FI_CONTEXT and provider support of RMA here:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 5927 to 5931 in 1006b3f
Any specific reason behind this change?
from aws-ofi-nccl.
Hi, this was mostly just a refactoring.
The previous code set hints->mode = FI_CONTEXT
for both RDMA and SENDRECV protocols (since SENDRECV protocol supports it), so we needed some logic to check that the provider didn't enable it when using RDMA protocol.
Since a recent refactoring, the hints->mode
parameter is set separately for the two protocols. The RDMA protocol does not include FI_CONTEXT
in hints->mode
(which means it does not support it), so it no longer needs the extra check.
Similarly, the check for FI_RMA
support from provider is not needed since it is requested as a capability bit in the hints:
aws-ofi-nccl/src/nccl_ofi_rdma.c
Lines 5763 to 5764 in 1006b3f
from aws-ofi-nccl.
Related Issues (20)
- Support Ubuntu 22.04 HOT 4
- Support FI_CONTEXT2 HOT 2
- Misleading comparison on unsigned integer
- Plugin fails if compiled against Libfabric 1.18 but run against Libfabric 1.17 or older. HOT 11
- Unable to find libcudart.so (1.7.1) HOT 6
- Running nccl-perf tests documentation is missing MPI instructions HOT 3
- What are some AI/ML workloads users can utilize to test performance of the plugin?
- Unable to force FI_HMEM to be used and FI_OPT_CUDA_API_PERMITTED is not respected by config scripts HOT 4
- Support Amazon Linux 2023 (AL2023) HOT 2
- Support Red Hat Enterprise Linux 9+ HOT 4
- Add more examples with more recent cuda versions HOT 2
- Topology Discovery Regression HOT 2
- GPU direct HOT 1
- NCCL internal error after aws-ofi-nccl upgrade to version 1.7.4 HOT 6
- Segfault after/during finalize with OpenMPI HOT 2
- Propagate "Invalid address" to NCCL communicator
- Building with release tarball throws `tuner_v1.h` not found error HOT 1
- Version HOT 2
- Assistance to broader Tag releases HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-ofi-nccl.