Giter VIP home page Giter VIP logo

Comments (9)

rauteric avatar rauteric commented on July 29, 2024 1

This absence is specific to Aws-ofi-nccl's RDMA protocol implementation. Context was omitted only for convenience in our first implementation, and can be added in the future.

from aws-ofi-nccl.

eliekozah-cornelisnetworks avatar eliekozah-cornelisnetworks commented on July 29, 2024

Hi @rauteric,

Thank you for the clarification. Our team is considering contributing a fix to introduce FI_CONTEXT support with RDMA and would love to get your thoughts on any implementation preferences or directions your team might have. This would help us tailor our contribution to fit into the project.

from aws-ofi-nccl.

rauteric avatar rauteric commented on July 29, 2024

Hi @eliekozah-cornelisnetworks,

Thank you for considering a contribution. In general, I think FI_CONTEXT support can be implemented similarly to how we implement it for the SENDRECV protocol (https://github.com/aws/aws-ofi-nccl/blob/master/src/nccl_ofi_sendrecv.c), by having a fi_context member of the request data structure.

The main obstacle we have currently to supporting FI_CONTEXT is that right now the context we use can be shared between multiple simultaneous Libfabric operations. The particular example I can find is RDMA write operations for individual rails:

rc = fi_writedata(comm_rail->local_ep, send_data->buff + xfer_info->offset,
xfer_info->msg_size, desc, send_data->wdata,
comm_rail->remote_addr,
send_data->remote_buff + xfer_info->offset,
send_data->remote_mr_key[rail_id], req);
will all use the same context (req). To support FI_CONTEXT these would need to be split either into separate fi_context objects inside the request structure, or preferably, subrequests of the parent request structure.

There may be other examples of this in the current code, but that's the only one I can find at the moment.

from aws-ofi-nccl.

eliekozah-cornelisnetworks avatar eliekozah-cornelisnetworks commented on July 29, 2024

Hi @rauteric,

I see you removed the check of provider requiring FI_CONTEXT and provider support of RMA here:

struct fi_info *info_list;
/* Retrieve NIC info list from topology */
info_list = nccl_ofi_topo_next_info_list(&data_iter);

Any specific reason behind this change?

from aws-ofi-nccl.

rauteric avatar rauteric commented on July 29, 2024

Hi, this was mostly just a refactoring.

The previous code set hints->mode = FI_CONTEXT for both RDMA and SENDRECV protocols (since SENDRECV protocol supports it), so we needed some logic to check that the provider didn't enable it when using RDMA protocol.

Since a recent refactoring, the hints->mode parameter is set separately for the two protocols. The RDMA protocol does not include FI_CONTEXT in hints->mode (which means it does not support it), so it no longer needs the extra check.

Similarly, the check for FI_RMA support from provider is not needed since it is requested as a capability bit in the hints:

/* Primary Capabilities */
hints->caps = FI_MSG | FI_RMA | FI_HMEM;

from aws-ofi-nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.