# Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning

## Metadata

Presented in [arxiv:2210.00093](https://arxiv.org/abs/2210.00093). (Accepted in NSDI 2023)

Authors: Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, Aditya Akella, *University of Wisconsin-Madison, The University of Texas at Austin*

Code: <https://github.com/uw-mad-dash/shockwave>

## Understanding the paper

### TL;DRs

This paper presents **Shockwave**, an efficient and *fair* scheduler for *DL training jobs* with *elastic resource requirements*.\
It *extends classic market theory from static settings to dynamic settings* to co-optimize efficiency and fairness; *utilizes stochastic dynamic programming* to handle dynamic changes.

### Limitations of Existing Work

* Dynamically adjusting *model structure* or *hyper-parameters* (e.g., batch size) can significantly *accelerate training* without *sacrificing accuracy*.
* Existing schedulers *fail to provide fairness and degrade system efficiency* when *the training throughput changes over time* under dynamic adaptation.

### Assumptions

* Treat dynamic adaptation as a part of the user’s program that cannot be modified.
  * Based on the measurement: automatically perform dynamic adaptation can hurt training accuracy.

### Challenges to Extend Market Theory

* The market needs to know *utility values* in the future to compute *market equilibrium*; but it is hard to predict utility in the future since dynamic adaptations in jobs are non-deterministically triggered.
  * Solution: Use *Bayesian statistics* to predict the patterns.
* Solving the dynamic market equilibrium for an (infinitely) long time horizon is difficult and impractical.
  * Solution: Only plan the schedule for *a finite length window* (e.g., 30-60 mins).

### Evaluation

* Baselines
  * OSSP (Open Shop Scheduling)
  * AlloX
  * Themis
  * MSS (Max-Sum-Throughput)
  * GandivaFair
  * Pollux
* Improve makespan by 1.3x and fairness by 2x with existing *fair schedulers*.
