Resource Scheduler
Last updated
Was this helpful?
Last updated
Was this helpful?
CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters () []
MIT & UT-Austin
Consider the communication pattern of different jobs while placing them on network links.
Blox: A Modular Toolkit for Deep Learning Schedulers () [] []
UW-Madison & MSR
Interference-aware Multiplexing for Deep Learning in GPU Clusters: A Middleware Approach () [] [] []
UMacau & SIAT, CAS
IADeep — a cluster scheduler to co-locate DL training tasks
Tune training configurations (e.g., batch size) across all co-located tasks; choose appropriate tasks to multiplex on a GPU device; consider PCIe bandwidth
Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling () []
CMU & Cornell & Petuum Inc.
Lyra: Elastic Scheduling for Deep Learning Clusters () [] [] []
ByteDance & CityU & CUHK
Loan idle inference GPU servers for elastic training jobs.
Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning () [] [] []
UW-Madison & UT-Austin
Elastic resource requirements; extend market theory.
Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs () [] [] []
NTU & Shanghai AI Lab & SenseTime
Scheduling interpretability
Multi-Resource Interleaving for Deep Learning Training () [] [] []
PKU & ByteDance
Muri: Pack jobs along multiple resource types in the time dimension
Integrate with PyTorch
Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads (arXiv 2202.07848) [] []
Microsoft
Live GPU job migration
Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters () [] [] []
MSR & UT-Austin & VMware Research
Consider the allocation of CPU and memory resources.
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning () [] [] []
Petuum & CMU
Best Paper Award
Co-adaptively allocates resources (number of GPUs) and tunes the hyperparameters (batch size and learning rate) for all DL training jobs.
MAPA: Multi-Accelerator Pattern Allocation Policy for Multi-Tenant GPU Servers (SC 2021) [] []
UC Riverside & Pacific Northwest National Lab & USydney
Consider multi-GPU accelerator topologies such as single/double NVLink.
Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters (TPDS 2021) []
PKU & NTU & SenseTime
Long-term GPU-time fairness
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning () [] []
Alibaba
Co-locate resource-guarantee and best-effort jobs.
HiveD: Sharing a GPU Cluster for Deep Learning with Guarantees () [] [] []
MSRA
Virtual private clusters; resource isolation and management for multi-tenant clusters.
Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads () [] []
MSR & Stanford
Gavel: Consider performance heterogeneity across multiple accelerator types.
Themis: Fair and Efficient GPU Cluster Scheduling () []
UW-Madison & MSR
Long-term fairness
AlloX: Compute Allocation in Hybrid Clusters () [] []
Stony Brook University & SUNY Korea & UMich
CPU-GPU hybrid clusters; min-cost bipartite matching.
Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning (EuroSys 2019) []
MSR India
: Achieve efficiency and fairness despite cluster heterogeneity
Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 2019) [] []
UMich SymbioticLab
Relax consolidated placement constraint
Gandiva: Introspective Cluster Scheduling for Deep Learning () []
MSRA
Hyper-parameter tuning jobs; job packing; migration; grow-shrink; time-slicing.
Optimus: An Efficient Dynamic Resource Scheduler for Deep Learning Clusters (EuroSys 2018) [] []
HKU & ByteDance
Minimize JCT based on online resource-performance models.
Topology-Aware GPU Scheduling for Learning Workloads in Cloud Environments (SC 2017) [] []
Barcelona Supercomputing Center & IBM Watson Research Center
Consider multiple link technologies such as PCI-e and NVLink.
Princeton
Fine-grained job-level scheduler
Leverage the iterative nature of general ML training algorithms
HKUSt & Alibaba
GPU sharing traces
NTU & SenseTime
MSR
Alibaba PAI
NTU & PKU & SenseTime
DL: Deep Learning
ML: Machine Learning
SLAQ: Quality-Driven Scheduling for Distributed Machine Learning () [] []
MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters () [] []
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters (SC 2021) [] []
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (ATC 2019) [] []
Characterizing Deep Learning Training Workloads on Alibaba-PAI (IISWC 2019) []
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision (arXiv 2205.11913) [] []