Multi-resource interleaving for deep learning training
#deep_learning_training_workloads #multi_resource_scheduler #multi_resource_interleaving #PyTorch #iterative_process #blossom_algorithm
Last updated
Was this helpful?
#deep_learning_training_workloads #multi_resource_scheduler #multi_resource_interleaving #PyTorch #iterative_process #blossom_algorithm
Last updated
Was this helpful?
Presented in .
Authors: Yihao Zhao, Yuanqiang Liu, Yanghua Peng, Yibo Zhu, Xuanzhe Liu, Xin Jin.
Code:
This paper presents Muri, a DL cluster scheduler that utilizes multi-resource interleaving to improve cluster and job efficiency. It designs a scheduling algorithm based on Blossom algorithm for multi-resource multi-job packing.
DL training jobs have a unique staged, iterative computation pattern.
Motivate fine-grained multi-resource interleaving in time.
Each iteration is composed of a sequence of stages like data loading (Storage IO), preprocessing (CPUs), forward and backward propagation (GPUs), and gradient synchronization (Network IO).
Built a prototype of Muri with ~7k LoC and integrated it with PyTorch.