Multi-resource interleaving for deep learning training
#deep_learning_training_workloads #multi_resource_scheduler #multi_resource_interleaving #PyTorch #iterative_process #blossom_algorithm
Meta Info
Presented in SIGCOMM 2022.
Authors: Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, Xin Jin.
Code: https://github.com/Rivendile/Muri
Understanding the paper
TL;DRs
This paper presents Muri, a DL cluster scheduler that utilizes multi-resource interleaving to improve cluster and job efficiency. It designs a scheduling algorithm based on Blossom algorithm for multi-resource multi-job packing.
Key Insights
DL training jobs have a unique staged, iterative computation pattern.
Motivate fine-grained multi-resource interleaving in time.
Each iteration is composed of a sequence of stages like data loading (Storage IO), preprocessing (CPUs), forward and backward propagation (GPUs), and gradient synchronization (Network IO).
Implementation
Built a prototype of Muri with ~7k LoC and integrated it with PyTorch.
Last updated