Giter VIP home page Giter VIP logo

awesome-dl-scheduling-papers's Introduction

Awesome-DL-Scheduling-Papers

🔥 A curated list of DL cluster scheduling papers.

Please feel free to pull requests or open an issue to add papers.

Table of Content

Schedulers for DL Training

Scheduler Year Series Paper Objective Heter. Elastic AutoML Code
Acme 2024 NSDI Paper - - - Code
Cassini 2024 NSDI Paper ♠♣♥ - - -
Sia 2023 SOSP Paper ♠♣♥ Code
EasyScale 2023 SC Paper ♠♣ - Code
Hydro 2023 OSDI Paper ♠▲ Code
Shockwave 2023 NSDI Paper ♠♣♥ - - Code
ModelKeeper 2023 NSDI Paper ♦♣ - - Code
Lyra 2023 EuroSys Paper ♠♣ - -
SiloD 2023 EuroSys Paper ♠♣♥ ✔* - - -
FGD 2023 ATC Paper ♠♣ - - - Code
ElasticFlow 2023 ASPLOS Paper ♣✿ - - Code
Lucid 2023 ASPLOS Paper ♠♣ - - - Code
PowerFlow 2023 arxiv Paper ♦♣ - - -
EDL 2022 TPDS Paper ♠♣ - - -
AOnline 2022 TCC Paper ♠♣ - - -
Titan 2022 SoCC Paper ♠♣ - - - -
Muri 2022 SIGCOMM Paper ♠♣ ✔* - - Code
Synergy 2022 OSDI Paper - - - Code
Ali-MLaaS 2022 NSDI Paper ♠♣ - - - Code
GADGET 2022 INFOCOM Paper ♠♣ - - Code
CloudBrain 2022 ICCD Paper - - - Code
Aryl 2022 arxiv Paper ♠♣ - -
Singularity 2022 arxiv Paper ♠♣♦ - - -
$DL^2$ 2021 TPDS Paper - - Code
Astraea 2021 TPDS Paper - - - Code
Horus 2021 TPDS Paper ♠♣ - - - -
Liquid 2021 TPDS Paper - - - Code
POP 2021 SOSP Paper ♥♣ - - Code
Chronus 2021 SoCC Paper - - - Code
SEER 2021 SoCC Paper - -
Helios 2021 SC Paper ♣♦ - - - Code
ONES 2021 SC Paper ♠♣ - - Code
Pollux 2021 OSDI Paper ♠♣♥ - Code
AFS 2021 NSDI Paper ♠♣ - - -
SMD 2021 INFOCOM Paper - - - -
ANDREAS 2021 FCloud Paper - - - -
RubberBand 2021 EuroSys Paper - -
Hermes 2021 Electronics Paper - - -
Jigsaw 2021 DistributedML Paper - - - -
DynamoML 2021 CLOSER Paper ♠♣ - - -
GENIE 2020 TPDS Paper - - -
Parrot 2020 TCC Paper - - - -
Non-Intrusive 2020 SC Paper ♠♣ - - -
Antman 2020 OSDI Paper ♠♣ - - Code
Gavel 2020 OSDI Paper ♣♥ - - Code
HiveD 2020 OSDI Paper - - - Code
Themis 2020 NSDI Paper - - - -
Salus 2020 MLSys Paper ♠♣ - - - Code
Vaibhav et al. 2020 MASCOTS Paper ♠♣ - - -
SPIN 2020 INFOCOM Paper - - - -
E-LAS 2020 ICPP Paper - - - -
CODA 2020 ICDCS Paper ✔* - - -
Elan 2020 ICDCS Paper ♠♣ - - -
Yeung 2020 HotCloud Paper - - - -
$Gandiva_{fair}$ 2020 EuroSys Paper ♥♣ - - -
MLCloudPrice 2020 DISPA Paper ♣♦ - - - Code
MLFS 2020 CoNext Paper ♣✿ - - - Code
MARBLE 2020 CCGRID Paper ♠♣ - - -
Ada-SRSF 2020 arxiv Paper ✔* - - -
Co-scheML 2020 ACSOS Paper - - - -
HyperSched 2019 SoCC Paper ✿ ▲ - -
Tiresias 2019 NSDI Paper - - - Code
FfDL 2019 Middleware Paper - - - Code
JPAS 2019 JNCA Paper ♣▲ - - -
Harmony 2019 INFOCOM Paper - - - -
Cynthia 2019 ICPP Paper - - -
Jahani 2019 ICCCS Paper - -
$Sched^2$ 2019 GLOBECOM Paper - - - -
Dragon 2019 CLOSER Paper ♠♣ - - -
$FC^2$ 2019 CC Paper ✔* - -
Philly 2019 ATC Paper - - - Code
Gandiva 2018 OSDI Paper ♠♣ - -
OASiS 2018 INFOCOM Paper ♠♣ - - -
Optimus 2018 EuroSys Paper - - Code
Dorm 2017 SMARTCOMP Paper - - - -
Topology-Aware 2017 SC Paper - - - Code
HyperDrive 2017 Middleware Paper ♣▲ - - -

Symbols of Training Schedulers:

JCT Utilization Cost Fairness DDL Accuracy

Schedulers for DL Inference

Scheduler Year Series Paper Objective Batch Share Cloud Source Code
SpotServe 2024 ASPLOS Paper ♦♥ - Code
DeltaZip 2023 arxiv Paper ♥♠ - - Code
MOSEL 2023 arxiv Paper ♦♠ - - -
Punica 2023 arxiv Paper ♥♠ - ✔* - Code
S-LoRA 2023 arxiv Paper ♥♠ - ✔* - Code
Symphony 2023 arxiv Paper ✿♠♦ - - -
DeepPlan 2023 EuroSys Paper ♦♠ - - Code
Tabi 2023 EuroSys Paper ♦♣ - - -
Kairos 2023 HPDC Paper ♦♥♠ - Code
Shepherd 2023 NSDI Paper ✿♠♦ - - -
AlpaServe 2023 OSDI Paper ♦♠ - Code
Clover 2023 SC Paper - Code
iGniter 2023 TPDS Paper ✿♥ Code
Gpulet 2022 ATC Paper ✿♠♥ - Code
Cocktail 2022 NSDI Paper ♣♦♥ - - Code
INFaaS 2021 ATC Paper ♦♥♠ - Code
MIG-SERVING 2021 CoRR Paper ♦♥ - -
Mendoza et al. 2021 EuroMLSys Paper - - -
Abacus 2021 SC Paper ♦♠ - - Code
Morphling 2021 SoCC Paper ♥♠ Code
Irina 2020 APNet Paper ♦♠✿ - -
DyBatch 2020 CCGrid Paper ♦♠ - -
CMS 2020 Future Internet Paper ♣✿ - - - -
PERSEUS 2020 IC2E Paper ♦♥♠ - Code
AutoDeep 2020 Infocom Paper ♦♥♠ - -
Clockwork 2020 OSDI Paper ♦♠ - - Code
GSLICE 2020 SoCC Paper ♠✿ - -
Inferline 2020 SoCC Paper ♦♥ - Code
MArk 2019 ATC Paper ♦♥ - Code
TrIMS 2019 CLOUD Paper ♦♠✿ Code
Kube-Knots 2019 CLUSTER Paper ♦✿ - -
Gilman et al. 2019 DIDL Paper ♦♠ - - -
Nanily 2019 HPCC Paper ♦♠ - - -
Ebird 2019 ICCD Paper ♦♠✿ - Code
Tolerance Tiers 2019 ISPASS Paper ♣♦♥ - - -
RRL 2019 SC Paper - Code
ParM 2019 SOSP Paper - - Code
HiveMind 2018 NIPS Paper - -
Space-Time 2018 NIPS Paper ♠✿ - -
Ease.ml 2018 VLDB Paper - - - Code
Rafiki 2018 VLDB Paper ♣♦ - - Code
Clipper 2017 NSDI Paper ♣♦♠ - - Code

Symbols of Inference Schedulers:

Accuracy Throughput Latency Cost Utilization

Glossary of Terms

Terminology Definition
JCT Job Completion Time (Job Finish Time - Job Submission Time)
Fairness a metric to assess whether resources are fairly shared among users or jobs
QoS Quality of Service
DDL Deadline, a time point where DL job must be completed
SLO Service Level Objective

awesome-dl-scheduling-papers's People

Contributors

gaow0007 avatar smile-luobin avatar tonyhao96 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-dl-scheduling-papers's Issues

TPDS'22

ASTRAEA: A Fair Deep Learning Scheduler for Multi-tenant GPU Clusters
Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems
Elastic Parameter Server: Accelerating ML Training With Scalable Resource Scheduling

EuroSys'22

Out-Of-Order BackProp: An Effective Scheduling Technique for Deep Learning

OSDI'22

Achieving μs-scale Preemption for Concurrent GPU-accelerated DNN Inferences
该论文提出面向DNN推理任务的强实时高并发GPU调度技术,支持对非实时任务的百微秒级抢占,可将系统整体吞吐量提高了1.1~4.3倍。

ASPLOS'22

VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling

TC'22

Online Scheduling of Distributed Machine Learning Jobs for Incentivizing Sharing in Multi-Tenant Systems

Others

TNSE'22:Toward Efficient Online Scheduling for Distributed Machine Learning Systems

arXiv'22

Automated Runtime-Aware Scheduling for Multi-Tenant DNN Inference on GPU

INFOCOM'22

Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration
AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.