Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads

Live GPU job migration.

Metadata

Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa, Mark Russinovich (Microsoft)

Understanding the paper

TL;DR

This paper proposes a transparent and robust mechanism to checkpoint generic DNN training jobs that are written without checkpointing support, thus making all (even unmodified) jobs automatically checkpointable, preemptible, migratable, and elastic.

Benefits to Users

Improved fault tolerance.
Opportunistic usage of capacity.
- Enable BE jobs to opportunistically use spare capacity, and be quickly preempted (without lost work) when Prod jobs arrive.
Background defragmentation for locality.
- Migration of small jobs enables the scheduler to defragment locality domains to place larger jobs.
Online upgrades.
- Cluster-wide.

Previous work

Propose migration and elastic resource sharing as an enabler for better scheduling (e.g., Gandiva, GandivaFair).
Don't address the problem of how to deal with the vast majority of jobs that are not migratable or elastic.

A remarkably simple user experience

The user focuses only on the ML task and does not need to think about checkpointing or elasticity.

How to checkpoint?

Program state (CPU)
Device state (GPU)
Communication state (NCCL)
File system state

CRIU

Homepage: https://criu.org/Main_Page
Provide address-space migration to checkpoint program state.
Key limitation: cannot handle device mappings by processes using GPU.
Related Issues
- criu dump failed for /dev/fb0: https://github.com/checkpoint-restore/criu/issues/527
- CRIU Cuda support: https://github.com/checkpoint-restore/criu/issues/534

Device Proxy

Intercept the CUDA API via the LD_PRELOAD mechanism.

Server component: one per device. Client component: embedded in each process interacting with the device.

Two types of interceptors:

Dispatch Interceptors ( $\text{D}_{\text{Int}}$ )
- Ship the API cross-address-space to the device proxy server.
- Handle serialization/deserialization of parameters/response.
Semantics-Aware Interceptors ( $\text{SA}_{\text{Int}}$ )
- Memory allocation.
- Communication. (Barrier)
- Device Synchronization.

Checkpoint device state

Model state (e.g., parameters) is checkpointed by the device-proxy process via device-to-host memcpy.
All stateful API calls (e.g., creation of context, stream, event, etc.) are annotated, and the $\text{D}_{\text{Int}}$ for those calls automatically log them for replay upon restore.

Checkpoint communication state

Cannot handle in-flight communication.
Quiesce the job to ensure that there are no collective calls in flight (acquire the barrier).

Checkpoint file system state

The libc I/O libraries (e.g., open, read, write, etc.) are intercepted by a $\text{SA}_{\text{Int}}$ to track/log updates made by the job to the local file system.
Mutated files can be migrated along with the process checkpoint.

Checkpoint/Restore flow

Acquire the barrier.
Generate a GPU and CRIU dump.
Upload the dump to remote storage.
Download the dump on a different set of nodes.
Restore the CRIU dumps and GPU dumps, and release the barrier.

Transparent checkpointing

On-demand: when job needs to be preempted.
Periodically: specified by the user (epoch-level or time-based)

Transparent elasticity

Last updated 1 year ago

Was this helpful?