Looking beyond GPUs for DNN scheduling on multi-tenant clusters

#deep_learning_training_workloads #resource_scheduler #homogeneous_cluster

Meta Info

Presented in OSDI 2022.

Authors: Jayashree Mohan, Amar Phanishayee, Janardhan Kulkarni (Microsoft Research), Vijay Chidambaram (UT-Austin & VMware Research).

Code: https://github.com/msr-fiddle/synergy

Understanding the paper

TL;DR

This paper presents a scheduler for DNN training jobs named Synergy, which considers the resource sensitivity to the allocation of CPU and memory resources.

Limitations

It doesn't consider fractional GPU allocations (no GPU share).

Scheduling algorithms

  • It proposes two algorithms to enable multi-dimensional bin-packing.

    • Synergy-Opt

      • Find approximate solutions using ILP formulation (typical Microsoft style...).

      • Computationally expensive.

    • Synergy-Tune

  • Compared to a naive greedy algorithm Synergy-Greedy.

    • First-fit. Place the job on the server that can satisfy the job's demands in all dimensions.

    • Problems

      • Result in GPU fragmentation as auxiliary resources are exhausted by jobs.

      • Hurt the fairness as some jobs can be skipped over for a long time if their demands cannot be satisfied in the cluster.

    • (In my view, this is a poor baseline...)

Implementation

A prototype of Synergy and an associate event-driven simulator are implemented in Python.

Evaluation

  • Testbed

    • 4 node cluster, each node with 500GB DRAM, 24 CPU cores, and 8 V100 GPUs

  • Simulation

    • Consider two clusters:

      • 16-node cluster (same node configuration as above)

      • 64-node cluster (same node configuration as above)

  • Assume a CPU:GPU ratio of 3 and fair-share memory of 62.5GB per GPU.

Last updated