# Multi-resource interleaving for deep learning training

## Meta Info

Presented in [SIGCOMM 2022](https://doi.org/10.1145/3544216.3544224).

Authors: Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, Xin Jin.

Code: <https://github.com/Rivendile/Muri>

## Understanding the paper

### TL;DRs

This paper presents **Muri**, a DL *cluster scheduler* that utilizes *multi-resource interleaving* to improve cluster and job efficiency.\
It designs a scheduling algorithm based on *Blossom algorithm* for multi-resource multi-job packing.

### Key Insights

* DL training jobs have a unique staged, *iterative computation pattern*.
  * Motivate fine-grained multi-resource interleaving in *time*.
  * Each iteration is composed of a sequence of stages like *data loading (Storage IO)*, *preprocessing (CPUs)*, *forward and backward propagation (GPUs)*, and *gradient synchronization (Network IO)*.

### Implementation

Built a prototype of Muri with \~7k LoC and integrated it with **PyTorch**.
