> For the complete documentation index, see [llms.txt](https://paper.lingyunyang.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://paper.lingyunyang.com/reading-notes/conference/osdi-2022/synergy.md).

# Looking beyond GPUs for DNN scheduling on multi-tenant clusters

## Meta Info

Presented in [OSDI 2022](https://www.usenix.org/conference/osdi22/presentation/mohan).

Authors: Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni (*Microsoft Research*), Vijay Chidambaram (*UT-Austin & VMware Research*).

Code: <https://github.com/msr-fiddle/synergy>

## Understanding the paper

### TL;DR

This paper presents a scheduler for DNN training jobs named **Synergy**, which considers the resource sensitivity to the allocation of CPU and memory resources.

### Limitations

It doesn't consider **fractional GPU allocations** (no GPU share).

### Scheduling algorithms

* It proposes two algorithms to enable **multi-dimensional bin-packing**.
  * **Synergy-Opt**
    * Find approximate solutions using **ILP formulation** (typical Microsoft style...).
    * Computationally expensive.
  * **Synergy-Tune**
    * Sort the pending jobs by their GPU demands, followed by CPU, and memory demand.
    * Pick the server with **the least amount of free resources** just enough to fit the demand vector of the job.
      * Code: <https://github.com/msr-fiddle/synergy/blob/master/simulator/resources/cluster.py#L382>
    * The GPU demand is fixed, but **the auxiliary resource allocations (CPU, memory) are fungible**.
    * Within 10% of the optimal value (**Synergy-Opt**).
* Compared to a naive greedy algorithm **Synergy-Greedy**.
  * *First-fit*. Place the job on the server that can satisfy the job's demands in all dimensions.
  * Problems
    * Result in GPU fragmentation as auxiliary resources are exhausted by jobs.
    * Hurt the fairness as some jobs can be skipped over for a long time if their demands cannot be satisfied in the cluster.
  * (In my view, this is a poor baseline...)

### Implementation

A prototype of Synergy and an associate event-driven simulator are implemented in *Python*.

### Evaluation

* Testbed
  * 4 node cluster, each node with 500GB DRAM, 24 CPU cores, and 8 V100 GPUs
* Simulation
  * Consider two clusters:
    * 16-node cluster (same node configuration as above)
    * 64-node cluster (same node configuration as above)
* Assume a **CPU:GPU ratio of 3** and **fair-share memory of 62.5GB per GPU**.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2022/synergy.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
