This paper presents Muri, a DL cluster scheduler that utilizes multi-resource interleaving to improve cluster and job efficiency.
It designs a scheduling algorithm based on Blossom algorithm for multi-resource multi-job packing.
Key Insights
DL training jobs have a unique staged, iterative computation pattern.
Motivate fine-grained multi-resource interleaving in time.
Each iteration is composed of a sequence of stages like data loading (Storage IO), preprocessing (CPUs), forward and backward propagation (GPUs), and gradient synchronization (Network IO).
Implementation
Built a prototype of Muri with ~7k LoC and integrated it with PyTorch.