# Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences

## Meta Info

Presented in [OSDI 2022](https://www.usenix.org/conference/osdi22/presentation/han).

Authors: Mingcong Han, Hanze Zhang, Rong Chen, Haibo Chen (*SJTU*).

Code: <https://github.com/SJTU-IPADS/reef>

DNN Inference Serving Benchmark: <https://github.com/SJTU-IPADS/disb>

Artifact: <https://github.com/SJTU-IPADS/reef-artifacts/tree/osdi22-ae>

## Understanding the paper

### TL;DR

This paper presents a GPU-accelerated DNN inference serving system named **REEF**, which enables microsecond-scale **kernel preemption** and controlled **concurrent execution** in GPU scheduling.

### Technical details

* Two execution modes
  * Normal mode
    * when the real-time task queue is empty
    * assign tasks to the associated GPU streams
  * Real-time mode
    * when encountering real-time tasks
    * Dynamic kernel padding (DKP)
      * combine appropriate best-effort kernels and real-time kernels into a single one
      * use a GPU stream to launch it
      * Two rules
        * The execution time: best-effort kernels < real-time kernel
        * The CU occupancy: best-effort kernels > real-time kernel

<figure><img src="/files/9gciOwQ6KByk33V9tRgx" alt=""><figcaption><p>Extended GPU runtime for instant preemption.</p></figcaption></figure>

* For host queues (HQs), dequeue all buffered kernels and reclaim memory
* For device queues (DQs), inject a piece of code at the beginning of each kernel in advance and terminate itself when preemption flag is true.
* For compute units (CUs), retrofit the kernel killing function of GPU driver on AMD GPUs.
* For *closed-source NVIDIA GPUs*, propose a stricted version of reset-based preemption named **Reef-N**.
  * GPU runtime has to be treated as a black box. Cannot proactively kill running kernels.
  * Intercepts three CUDA APIs related to kernel launch and stream management. Maintain a virtual host queue (vHQs).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2022/reef.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
