Resource Scheduler

I am actively maintaining this list.

Scheduling for DL Training Workloads

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters (NSDI 2024) [Paper]
- MIT & UT-Austin
- Consider the communication pattern of different jobs while placing them on network links.
Blox: A Modular Toolkit for Deep Learning Schedulers (EuroSys 2024) [arXiv] [Code]
- UW-Madison & MSR
Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach (SC 2023) [Personal Notes] [Paper] [Code]
- UMacau & SIAT, CAS
- IADeep — a cluster scheduler to co-locate DL training tasks
- Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling (SOSP 2023) [Paper]
- CMU & Cornell & Petuum Inc.
Lyra: Elastic Scheduling for Deep Learning Clusters (EuroSys 2023) [Personal Notes] [Paper] [arXiv]
- ByteDance & CityU & CUHK
- Loan idle inference GPU servers for elastic training jobs.
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning (NSDI 2023) [Personal Notes] [Paper] [Code]
- UW-Madison & UT-Austin
- Elastic resource requirements; extend market theory.
Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs (ASPLOS 2023) [Personal Notes] [Paper] [Code]
- NTU & Shanghai AI Lab & SenseTime
- Scheduling interpretability
Multi-Resource Interleaving for Deep Learning Training (SIGCOMM 2022) [Personal Notes] [Paper] [Code]
- PKU & ByteDance
- Muri: Pack jobs along multiple resource types in the time dimension
- Integrate with PyTorch
Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [Personal Notes] [Paper]
- Microsoft
- Live GPU job migration
Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (OSDI 2022) [Personal Notes] [Paper] [Code]
- MSR & UT-Austin & VMware Research
- Consider the allocation of CPU and memory resources.
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 2021) [Personal Notes] [Paper] [Code]
- Petuum & CMU
- Best Paper Award
- Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.
MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers (SC 2021) [Paper] [Code]
- UC Riverside & Pacific Northwest National Lab & USydney
- Consider multi-GPU accelerator topologies such as single/double NVLink.
Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) [Paper]
- PKU & NTU & SenseTime
- Long-term GPU-time fairness
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 2020) [Paper] [Code]
- Alibaba
- Co-locate resource-guarantee and best-effort jobs.
HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees (OSDI 2020) [Personal Notes] [Paper] [Code]
- MSRA
- Virtual private clusters; resource isolation and management for multi-tenant clusters.
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (OSDI 2020) [Paper] [Code]
- MSR & Stanford
- Gavel: Consider performance heterogeneity across multiple accelerator types.
Themis: Fair and Efficient GPU Cluster Scheduling (NSDI 2020) [Paper]
- UW-Madison & MSR
- Long-term fairness
AlloX: Compute Allocation in Hybrid Clusters (EuroSys 2020) [Paper] [Code]
- Stony Brook University & SUNY Korea & UMich
- CPU-GPU hybrid clusters; min-cost bipartite matching.
Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2020) [Paper]
- MSR India
- $\text{Gandiva}_\text{Fair}$ : Achieve efficiency and fairness despite cluster heterogeneity
Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) [Paper] [Code]
- UMich SymbioticLab
- Relax consolidated placement constraint
Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]
- MSRA
- Hyper-parameter tuning jobs; job packing; migration; grow-shrink; time-slicing.
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters (EuroSys 2018) [Paper] [Code]
- HKU & ByteDance
- Minimize JCT based on online resource-performance models.
Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments (SC 2017) [Paper] [Code]
- Barcelona Supercomputing Center & IBM Watson Research Center
- Consider multiple link technologies such as PCI-e and NVLink.

Scheduling for General ML Training Workloads

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning (SoCC 2017) [Personal Notes] [Paper]
- Princeton
- Fine-grained job-level scheduler
- Leverage the iterative nature of general ML training algorithms

Trace Analysis

MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 2022) [Paper] [Trace]
- HKUSt & Alibaba
- GPU sharing traces
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC 2021) [Paper] [Trace]
- NTU & SenseTime
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (ATC 2019) [Paper] [Trace]
- MSR
Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) [Paper]
- Alibaba PAI

Survey

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [Paper] [Paper List]
- NTU & PKU & SenseTime

Acronyms

DL: Deep Learning
ML: Machine Learning

Last updated 2 months ago

Was this helpful?