Resource Scheduler

I am actively maintaining this list.

Scheduling for DL Training Workloads

  • Blox: A Modular Toolkit for Deep Learning Schedulers (EuroSys 2024) [arXiv] [Code]

    • UW-Madison & MSR

  • Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach (SC 2023) [Personal Notes] [Paper] [Code]

    • UMacau & SIAT, CAS

    • IADeep ā€” a cluster scheduler to co-locate DL training tasks

    • Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth

  • Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling (SOSP 2023) [Paper]

    • CMU & Cornell & Petuum Inc.

  • Lyra: Elastic Scheduling for Deep Learning Clusters (EuroSys 2023) [Personal Notes] [Paper] [arXiv]

    • ByteDance & CityU & CUHK

    • Loan idle inference GPU servers for elastic training jobs.

  • Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning (NSDI 2023) [Personal Notes] [Paper] [Code]

    • UW-Madison & UT-Austin

    • Elastic resource requirements; extend market theory.

  • Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs (ASPLOS 2023) [Personal Notes] [Paper] [Code]

    • NTU & Shanghai AI Lab & SenseTime

    • Scheduling interpretability

  • Multi-Resource Interleaving for Deep Learning Training (SIGCOMM 2022) [Personal Notes] [Paper] [Code]

    • PKU & ByteDance

    • Muri: Pack jobs along multiple resource types in the time dimension

    • Integrate with PyTorch

  • Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [Personal Notes] [Paper]

    • Microsoft

    • Live GPU job migration

  • Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters (OSDI 2022) [Personal Notes] [Paper] [Code]

    • MSR & UT-Austin & VMware Research

    • Consider the allocation of CPU and memory resources.

  • Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 2021) [Personal Notes] [Paper] [Code]

    • Petuum & CMU

    • Best Paper Award

    • Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.

  • MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers (SC 2021) [Paper] [Code]

    • UC Riverside & Pacific Northwest National Lab & USydney

    • Consider multi-GPU accelerator topologies such as single/double NVLink.

  • Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) [Paper]

    • PKU & NTU & SenseTime

    • Long-term GPU-time fairness

  • AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 2020) [Paper] [Code]

    • Alibaba

    • Co-locate resource-guarantee and best-effort jobs.

  • HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees (OSDI 2020) [Personal Notes] [Paper] [Code]

    • MSRA

    • Virtual private clusters; resource isolation and management for multi-tenant clusters.

  • Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads (OSDI 2020) [Paper] [Code]

    • MSR & Stanford

    • Gavel: Consider performance heterogeneity across multiple accelerator types.

  • Themis: Fair and Efficient GPU Cluster Scheduling (EuroSys 2020) [Paper]

    • UW-Madison & MSR

    • Long-term fairness

  • AlloX: Compute Allocation in Hybrid Clusters (EuroSys 2020) [Paper] [Code]

    • Stony Brook University & SUNY Korea & UMich

    • CPU-GPU hybrid clusters; min-cost bipartite matching.

  • Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2019) [Paper]

    • MSR India

    • GandivaFair\text{Gandiva}_\text{Fair}: Achieve efficiency and fairness despite cluster heterogeneity

  • Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) [Paper] [Code]

    • UMich SymbioticLab

    • Relax consolidated placement constraint

  • Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]

    • MSRA

    • Hyper-parameter tuning jobs; job packing; migration; grow-shrink; time-slicing.

  • Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters (EuroSys 2018) [Paper] [Code]

    • HKU & ByteDance

    • Minimize JCT based on online resource-performance models.

  • Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments (SC 2017) [Paper] [Code]

    • Barcelona Supercomputing Center & IBM Watson Research Center

    • Consider multiple link technologies such as PCI-e and NVLink.

Scheduling for General ML Training Workloads

  • SLAQ: Quality-Driven Scheduling for Distributed Machine Learning (SoCC 2017) [Personal Notes] [Paper]

    • Princeton

    • Fine-grained job-level scheduler

    • Leverage the iterative nature of general ML training algorithms

Trace Analysis

  • MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 2022) [Paper] [Trace]

    • HKUSt & Alibaba

    • GPU sharing traces

  • Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC 2021) [Paper] [Trace]

    • NTU & SenseTime

  • Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (ATC 2019) [Paper] [Trace]

    • MSR

  • Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) [Paper]

    • Alibaba PAI

Survey

  • Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [Paper] [Paper List]

    • NTU & PKU & SenseTime

Acronyms

  • DL: Deep Learning

  • ML: Machine Learning

Last updated