Giter VIP home page Giter VIP logo

aws-ofi-nccl's Introduction

AWS OFI NCCL

AWS OFI NCCL is a plug-in which enables EC2 developers to use libfabric as a network provider while running NVIDIA's NCCL based applications.

Overview

Machine learning frameworks running on top of NVIDIA GPUs use a library called NCCL which provides standard collective communication routines for an arbitrary number of GPUs installed across single or multiple nodes.

This project implements a plug-in which maps NCCLs connection-oriented transport APIs to libfabric's connection-less reliable interface. This allows NCCL applications to take benefit of libfabric's transport layer services like reliable message support and operating system bypass.

Getting Started

The best way to build the plugin is to start with the latest release package. The plugin developers highly discourage customers from building directly from the HEAD of a GitHub branch, as releases go through more extensive testing than the pre-commit testing on git branches. More information about installing the plugin from a released tarball can be found in INSTALL.md.

Version numbers that end in -aws have only been tested on Amazon Web Services Elastic Compute Cloud (EC2) instances and the Elastic Fabric Adapter (EFA) network transport. Customers using other networks may experience unexpected issues with these releases, but we welcome bug reports if that is the case.

Basic Requirements

The plugin is regularly tested on the following operating systems:

  • Amazon Linux 2
  • Ubuntu 20.04 LTS and 22.04 LTS

Other operating systems are likely to work; there is very little distribution-specific code in the plugin.

To build the plugin, you need to have Libfabric and HWLOC installed prior to building the plugin., If you want to run the included multi-node tests, you also need an MPI Implementation installed. Each release of the plugin has a list of dependency versions in the top-level README.md file.

The plugin does not require NCCL to be pre-installed, but obviously a NCCL installation is required to use the plugin. As of NCCL 2.4.8, it is possible to use the same plugin build across multiple versions of NCCL (such as those installed per-package with Conda-like environments).

Most Libfabric providers should work with the plugin, possibly through a utility provider. The plugin generally requires Reliable datagram endpoints (FI_EP_RDM) with tagged messaging (FI_TAGGED, FI_MSG). This is similar to the requirements of most MPI implementations and a generally tested path in Libfabric. For GPUDirect RDMA support, the plugin also requires FI_HMEM support, as well as RDMA support.

Getting Help

If you have any issues in building or using the package or if you think you may have found a bug, please open an issue.

Contributing

Reporting issues and sending pull requests are always welcome. To learn how you can contribute, please look at our contributing guidelines.

License

This library is licensed under the Apache 2.0 License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.