Giter VIP home page Giter VIP logo

deepops's Introduction

DeepOps

GPU infrastructure and automation tools

Overview

The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as NVIDIA DGX Systems). DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:

  • An on-prem, air-gapped data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
  • An existing cluster running Kubernetes where DeepOps scripts are used to deploy Kubeflow and connect NFS storage
  • An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm, Kubernetes, or a hybrid of both
  • A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime

Releases

Latest release: DeepOps 19.03 Release

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally functional but may change significantly between releases.

Getting Started

For detailed help or guidance, read through our Getting Started Guide or pick one of the deployment options documented below.

Deployment Options

Kubernetes

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications.

Consult the Kubernetes Guide for instructions on building a GPU-enabled Kubernetes cluster using DeepOps.

For more information on Kubernetes in general, refer to the official Kubernetes docs.

Slurm

Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

Consult the Slurm Guide for instructions on building a GPU-enabled Slurm cluster using DeepOps.

For more information on Slurm in general, refer to the official Slurm docs.

DGX POD Hybrid Cluster

A hybrid cluster with both Kubernetes and Slurm can also be deployed. This is recommended for DGX POD and other setups that wish to make maximal use of the cluster.

Consult the DGX POD Guide for step-by-step instructions on building a GPU-enabled hybrid cluster using DeepOps.

Virtual

To try DeepOps before deploying it on an actual cluster, a virtualized version of DeepOps may be deployed on a single node using Vagrant. This can be used for testing, adding new features, or configuring DeepOps to meet deployment-specific needs.

Consult the Virtual Guide to build a GPU-enabled virtual cluster with DeepOps.

Updating DeepOps

To update from a previous version of DeepOps to a newer release, please consult the Update Guide.

Copyright and License

This project is released under the BSD 3-clause license.

Issues

NVIDIA DGX customers should file an NVES ticket via NVIDIA Enterprise Services.

Otherwise, bugs and feature requests can be made by filing a GitHub Issue.

Contributing

To contribute, please issue a pull request against the master branch from a local fork.

A signed copy of the Contributor License Agreement needs to be provided to [email protected] before any change can be accepted.

deepops's People

Contributors

dholt avatar ajdecon avatar michael-balint avatar lukeyeager avatar supertetelman avatar satindern avatar ryanolson avatar scottesandiego avatar tvincereed avatar hightoxicity avatar mfoco avatar c1230a avatar weeellz avatar nvhans avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.