Multi-resource interleaving for deep learning training

#deep_learning_training_workloads #multi_resource_scheduler #multi_resource_interleaving #PyTorch #iterative_process #blossom_algorithm

Meta Info

Presented in SIGCOMM 2022.

Authors: Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, Xin Jin.

Code: https://github.com/Rivendile/Muri

Understanding the paper

TL;DRs

This paper presents Muri, a DL cluster scheduler that utilizes multi-resource interleaving to improve cluster and job efficiency. It designs a scheduling algorithm based on Blossom algorithm for multi-resource multi-job packing.

Key Insights

  • DL training jobs have a unique staged, iterative computation pattern.

    • Motivate fine-grained multi-resource interleaving in time.

    • Each iteration is composed of a sequence of stages like data loading (Storage IO), preprocessing (CPUs), forward and backward propagation (GPUs), and gradient synchronization (Network IO).

Implementation

Built a prototype of Muri with ~7k LoC and integrated it with PyTorch.

Last updated