Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
#deep_learning_inference_system #GPU_kernel_preemption #co-location
Meta Info
Presented in OSDI 2022.
Authors: Mingcong Han, Hanze Zhang, Rong Chen, Haibo Chen (SJTU).
Code: https://github.com/SJTU-IPADS/reef
DNN Inference Serving Benchmark: https://github.com/SJTU-IPADS/disb
Artifact: https://github.com/SJTU-IPADS/reef-artifacts/tree/osdi22-ae
Understanding the paper
TL;DR
This paper presents a GPU-accelerated DNN inference serving system named REEF, which enables microsecond-scale kernel preemption and controlled concurrent execution in GPU scheduling.
Technical details
Two execution modes
Normal mode
when the real-time task queue is empty
assign tasks to the associated GPU streams
Real-time mode
when encountering real-time tasks
Dynamic kernel padding (DKP)
combine appropriate best-effort kernels and real-time kernels into a single one
use a GPU stream to launch it
Two rules
The execution time: best-effort kernels < real-time kernel
The CU occupancy: best-effort kernels > real-time kernel
For host queues (HQs), dequeue all buffered kernels and reclaim memory
For device queues (DQs), inject a piece of code at the beginning of each kernel in advance and terminate itself when preemption flag is true.
For compute units (CUs), retrofit the kernel killing function of GPU driver on AMD GPUs.
For closed-source NVIDIA GPUs, propose a stricted version of reset-based preemption named Reef-N.
GPU runtime has to be treated as a black box. Cannot proactively kill running kernels.
Intercepts three CUDA APIs related to kernel launch and stream management. Maintain a virtual host queue (vHQs).
Last updated