Resource Scheduler
I am actively maintaining this list.
Scheduling for DL Training Workloads
Blox: A Modular Toolkit for Deep Learning Schedulers (EuroSys 2024) [arXiv] [Code]
UW-Madison & MSR
Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach (SC 2023) [Personal Notes] [Paper] [Code]
UMacau & SIAT, CAS
IADeep ā a cluster scheduler to co-locate DL training tasks
Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
Lyra: Elastic Scheduling for Deep Learning Clusters (EuroSys 2023) [Personal Notes] [Paper] [arXiv]
ByteDance & CityU & CUHK
Loan idle inference GPU servers for elastic training jobs.
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning (NSDI 2023) [Personal Notes] [Paper] [Code]
UW-Madison & UT-Austin
Elastic resource requirements; extend market theory.
Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs (ASPLOS 2023) [Personal Notes] [Paper] [Code]
NTU & Shanghai AI Lab & SenseTime
Scheduling interpretability
Multi-Resource Interleaving for Deep Learning Training (SIGCOMM 2022) [Personal Notes] [Paper] [Code]
PKU & ByteDance
Muri: Pack jobs along multiple resource types in the time dimension
Integrate with PyTorch
Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [Personal Notes] [Paper]
Microsoft
Live GPU job migration
Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (OSDI 2022) [Personal Notes] [Paper] [Code]
MSR & UT-Austin & VMware Research
Consider the allocation of CPU and memory resources.
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 2021) [Personal Notes] [Paper] [Code]
Petuum & CMU
Best Paper Award
Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.
Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) [Paper]
PKU & NTU & SenseTime
Long-term GPU-time fairness
HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees (OSDI 2020) [Personal Notes] [Paper] [Code]
MSRA
Virtual private clusters; resource isolation and management for multi-tenant clusters.
Themis: Fair and Efficient GPU Cluster Scheduling (EuroSys 2020) [Paper]
UW-Madison & MSR
Long-term fairness
AlloX: Compute Allocation in Hybrid Clusters (EuroSys 2020) [Paper] [Code]
Stony Brook University & SUNY Korea & UMich
CPU-GPU hybrid clusters; min-cost bipartite matching.
Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2019) [Paper]
MSR India
: Achieve efficiency and fairness despite cluster heterogeneity
Scheduling for General ML Training Workloads
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning (SoCC 2017) [Personal Notes] [Paper]
Princeton
Fine-grained job-level scheduler
Leverage the iterative nature of general ML training algorithms
Trace Analysis
Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) [Paper]
Alibaba PAI
Survey
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [Paper] [Paper List]
NTU & PKU & SenseTime
Acronyms
DL: Deep Learning
ML: Machine Learning
Last updated