Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
Live GPU job migration.
Metadata
Presented in arXiv:2202.07848.
Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa, Mark Russinovich (Microsoft)
Understanding the paper
TL;DR
This paper proposes a transparent and robust mechanism to checkpoint generic DNN training jobs that are written without checkpointing support, thus making all (even unmodified) jobs automatically checkpointable, preemptible, migratable, and elastic.
Benefits to Users
Improved fault tolerance.
Opportunistic usage of capacity.
Enable BE jobs to opportunistically use spare capacity, and be quickly preempted (without lost work) when Prod jobs arrive.
Background defragmentation for locality.
Migration of small jobs enables the scheduler to defragment locality domains to place larger jobs.
Online upgrades.
Cluster-wide.
Previous work
Propose migration and elastic resource sharing as an enabler for better scheduling (e.g., Gandiva, GandivaFair).
Don't address the problem of how to deal with the vast majority of jobs that are not migratable or elastic.
A remarkably simple user experience
The user focuses only on the ML task and does not need to think about checkpointing or elasticity.
How to checkpoint?
Program state (CPU)
Device state (GPU)
Communication state (NCCL)
File system state
CRIU
Homepage: https://criu.org/Main_Page
Provide address-space migration to checkpoint program state.
Key limitation: cannot handle device mappings by processes using GPU.
Related Issues
criu dump failed for /dev/fb0: https://github.com/checkpoint-restore/criu/issues/527
CRIU Cuda support: https://github.com/checkpoint-restore/criu/issues/534
Device Proxy
Intercept the CUDA API via the LD_PRELOAD
mechanism.
Server component: one per device. Client component: embedded in each process interacting with the device.
Two types of interceptors:
Dispatch Interceptors ()
Ship the API cross-address-space to the device proxy server.
Handle serialization/deserialization of parameters/response.
Semantics-Aware Interceptors ()
Memory allocation.
Communication. (Barrier)
Device Synchronization.
Checkpoint device state
Model state (e.g., parameters) is checkpointed by the device-proxy process via device-to-host memcpy.
All stateful API calls (e.g., creation of context, stream, event, etc.) are annotated, and the for those calls automatically log them for replay upon restore.
Checkpoint communication state
Cannot handle in-flight communication.
Quiesce the job to ensure that there are no collective calls in flight (acquire the barrier).
Checkpoint file system state
The libc I/O libraries (e.g., open, read, write, etc.) are intercepted by a to track/log updates made by the job to the local file system.
Mutated files can be migrated along with the process checkpoint.
Checkpoint/Restore flow
Acquire the barrier.
Generate a GPU and CRIU dump.
Upload the dump to remote storage.
Download the dump on a different set of nodes.
Restore the CRIU dumps and GPU dumps, and release the barrier.
Transparent checkpointing
On-demand: when job needs to be preempted.
Periodically: specified by the user (epoch-level or time-based)
Transparent elasticity
Last updated