# Orca: A distributed serving system for transformer-based generative models

## Meta Info

Presented in [OSDI 2022](https://www.usenix.org/conference/osdi22/presentation/yu).

Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (*Seoul National University, FriendliAI*)

## Understanding the paper

### TL;DRs

This paper presents a distributed serving system called **Orca**, which applies *selective batching* and *iteration-level scheduling* (instead of request, each time run only a single iteration of the model) to a Transformer-based model.

### Background

* ML inference serving: serving system + execution engine
  * Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)

### Key observations

The *transformer-based generative models* generate a next token *in an autoregressive manner*, so they need to be executed *multiple times* to process an inference request.

### Design

* Key designs
  * Schedule the execution at the granularity of *iteration* instead of *request*
  * Selective batching
    * Split the batch and process each request individually for *the attention operation*
      * The decision not to batch the executions of the Attention operation has *only a small impact on efficiency*
    * Apply batching to *other operations*
* Others
  * Simple first-come-first-served algorithm
  * Adopt intra-layer and inter-layer model parallelism
  * Reserve "max tokens" slots of GPU memory for storing the keys & values in advance
  * Tune the maximum batch size to maximize throughput while satisfying one’s latency budget
  * Separate the communication channels for control messages (plus tokens) and tensor data transfer

### Implementation

* 13k lines of C++, based on CUDA.
* Used gRPC for communication in the control plane.
* Used NCCL in the data plane.
* Implemented fused kernels for LayerNorm, Attention, and GeLU operators.

### Evaluation

* Models
  * GPT-3 models (13B, 101B, 175B, 341B)
  * No actual model checkpoint
  * Synthesized the trace of client requests
* **Orca** outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2022/orca.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
