Giter VIP home page Giter VIP logo

elasticflow's Introduction

ElasticFlow-artifact

We provide the artifact for the ASPLOS 2023 paper "ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning", including:

  • The main implementation of ElasticFlow.
  • Cluster simulation scripts (Sec 6.3 & 6.4 & 6.5), which get the main results of the paper.
  • Testbed experiment scripts (Sec 6.2 & 6.6).
  • Figure plotting scripts.

Simulation Experiments

General Simulation Experiments

Please see ElasticFlow/README.md for more details.

Pollux simulation

Please see pollux/pollux_simulator/README.md for more details.

Testbed Experiments

Note: Due to the execution scripts of testbed experiments are highly related to internal testbed platform, we only demonstrate the functionality and provide the reproduction steps on the hardware devices we use. Please adjust to your platform if you would like to execute the testbed experiment.

The testbed experiments require 16 nodes, each with 8 A100 GPUs, 96 CPU cores, 900 GB RAM, and eight NVIDIA Mellanox HDR InfiniBand HCAs. You may use the Azure Standard_ND96asr_A100 VMs for reproduction.

General Testbed Experiments

Please see ElasticFlow/README.md for more details.

Pollux Testbed Experiments

As the Pollux baseline is implemented on k8s, we do not interage Pollux in the ElasticFlow system for comparison. We use the open-sourced artifact from the Pollux repo for testbed experiments.

Please see pollux/pollux_testbed/README.md for more details.

Plotting Figures

Please refer to <repo>/plot_figure/README.md

Citation

If you use this code or survey in your research, please cite this project.

@inproceedings{GuZZXHCYHJL23,
  author       = {Diandian Gu and
                  Yihao Zhao and
                  Yinmin Zhong and
                  Yifan Xiong and
                  Zhenhua Han and
                  Peng Cheng and
                  Fan Yang and
                  Gang Huang and
                  Xin Jin and
                  Xuanzhe Liu},
  title        = {ElasticFlow: An Elastic Serverless Training Platform for Distributed
                  Deep Learning},
  booktitle    = {Proceedings of the 28th {ACM} International Conference on Architectural
                  Support for Programming Languages and Operating Systems, Volume 2,
                  {ASPLOS} 2023},
  pages        = {266--280},
  year         = {2023},
  doi          = {10.1145/3575693.3575721}
}

elasticflow's People

Contributors

gudiandian avatar

Stargazers

 avatar Hao Tian avatar Qingwei Ji avatar Linchang Xiao avatar rainy2k avatar Liu Junhan avatar  avatar BaaBaa avatar liudeyuan2021 avatar Chiantine P. Manigos avatar Xiao avatar Rui Pan 潘瑞 avatar  avatar EnanaShinonome avatar

Watchers

 avatar

elasticflow's Issues

How can you get the ElasticFlow\scheduler\throughputs_T4?

Hi!
I really want to know how you got the T4 data, by running Pollux's code or was it provided directly (I can't seem to find it)?
But it is obvious that the existing data of T4 cannot be used for ElasticFlow because it does not include the number of iterations of the model corresponding to the task under different global batch sizes and different numbers of GPUs.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.