Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
Metadata
Presented in arxiv:2210.00093. (Accepted in NSDI 2023)
Authors: Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, Aditya Akella, University of Wisconsin-Madison, The University of Texas at Austin
Code: https://github.com/uw-mad-dash/shockwave
Understanding the paper
TL;DRs
This paper presents Shockwave, an efficient and fair scheduler for DL training jobs with elastic resource requirements. It extends classic market theory from static settings to dynamic settings to co-optimize efficiency and fairness; utilizes stochastic dynamic programming to handle dynamic changes.
Limitations of Existing Work
Dynamically adjusting model structure or hyper-parameters (e.g., batch size) can significantly accelerate training without sacrificing accuracy.
Existing schedulers fail to provide fairness and degrade system efficiency when the training throughput changes over time under dynamic adaptation.
Assumptions
Treat dynamic adaptation as a part of the user’s program that cannot be modified.
Based on the measurement: automatically perform dynamic adaptation can hurt training accuracy.
Challenges to Extend Market Theory
The market needs to know utility values in the future to compute market equilibrium; but it is hard to predict utility in the future since dynamic adaptations in jobs are non-deterministically triggered.
Solution: Use Bayesian statistics to predict the patterns.
Solving the dynamic market equilibrium for an (infinitely) long time horizon is difficult and impractical.
Solution: Only plan the schedule for a finite length window (e.g., 30-60 mins).
Evaluation
Baselines
OSSP (Open Shop Scheduling)
AlloX
Themis
MSS (Max-Sum-Throughput)
GandivaFair
Pollux
Improve makespan by 1.3x and fairness by 2x with existing fair schedulers.
Last updated