Looking beyond GPUs for DNN scheduling on multi-tenant clusters
#deep_learning_training_workloads #resource_scheduler #homogeneous_cluster
Meta Info
Presented in OSDI 2022.
Authors: Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni (Microsoft Research), Vijay Chidambaram (UT-Austin & VMware Research).
Code: https://github.com/msr-fiddle/synergy
Understanding the paper
TL;DR
This paper presents a scheduler for DNN training jobs named Synergy, which considers the resource sensitivity to the allocation of CPU and memory resources.
Limitations
It doesn't consider fractional GPU allocations (no GPU share).
Scheduling algorithms
It proposes two algorithms to enable multi-dimensional bin-packing.
Synergy-Opt
Find approximate solutions using ILP formulation (typical Microsoft style...).
Computationally expensive.
Synergy-Tune
Sort the pending jobs by their GPU demands, followed by CPU, and memory demand.
Pick the server with the least amount of free resources just enough to fit the demand vector of the job.
The GPU demand is fixed, but the auxiliary resource allocations (CPU, memory) are fungible.
Within 10% of the optimal value (Synergy-Opt).
Compared to a naive greedy algorithm Synergy-Greedy.
First-fit. Place the job on the server that can satisfy the job's demands in all dimensions.
Problems
Result in GPU fragmentation as auxiliary resources are exhausted by jobs.
Hurt the fairness as some jobs can be skipped over for a long time if their demands cannot be satisfied in the cluster.
(In my view, this is a poor baseline...)
Implementation
A prototype of Synergy and an associate event-driven simulator are implemented in Python.
Evaluation
Testbed
4 node cluster, each node with 500GB DRAM, 24 CPU cores, and 8 V100 GPUs
Simulation
Consider two clusters:
16-node cluster (same node configuration as above)
64-node cluster (same node configuration as above)
Assume a CPU:GPU ratio of 3 and fair-share memory of 62.5GB per GPU.
Last updated