# Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning

## Metadata

Presented in [OSDI 2021](https://www.usenix.org/conference/osdi21/presentation/qiao).

Authors: Aurick Qiao, Sang Keun Choe, Suhas Jayaram Subramanya, Willie Neiswanger, Qirong Ho, Hao Zhang, Gregory R. Ganger, Eric P. Xing

Code (AdaptDL): <https://github.com/petuum/adaptdl>

## Understanding the paper

### TL;DR

This paper presents a deep learning cluster scheduler named **Pollux**, which co-adaptively **allocates resources** (the number of GPUs) and **tunes the hyperparameters** (the batch size and learning rate) for all **DL training jobs** in a shared cluster.

### Background

* The running time of each training iteration can be divided into two main components.
  * $$\text{T}\_{\text{grad}}$$: the time spent computing the gradient.
  * $$\text{T}\_{\text{sync}}$$: the time spent synchronizing across all GPUs.
    * Collective all-reduce => average gradient
    * Parameter servers => synchronize weight
* GNS (noise-to-signal ratio of the stochastic gradient)
  * A larger GNS => higher batch size or learning rate with less reduction of statistical efficiency
  * Vary greatly between different DL models
  * Non-constant, gradually increase during training
* Existing DL schedulers
  * Non-scale-adaptive: Tiresias, Gandiva => require users to specify the number of GPUs
  * Scale-adaptive: Optimus, SLAQ, Gavel, Antman, Themis

### Key designs

* Different number of GPUs, different stage of training => different best batch size => larger batch sizes can be more useful later in training
* Propose a formulation of **goodput** for DL jobs, which combines system throughput with model statistical efficiency.
* Focus on three configuration parameters
  * The number of GPUs
  * Per-GPU batch size
  * Number of gradient accumulation steps
* Adaptively co-optimize inter-dependent factors: 1) at the **per-job** level; 2) at the **cluster-wide** level.

### Evaluation

Compared to SOTA DL schedulers,

* reduce average job completion times (JCT)
* promote fairness among DL jobs


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2021/pollux.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
