# Orca: A distributed serving system for transformer-based generative models

## Meta Info

Presented in [OSDI 2022](https://www.usenix.org/conference/osdi22/presentation/yu).

Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (*Seoul National University, FriendliAI*)

## Understanding the paper

### TL;DRs

This paper presents a distributed serving system called **Orca**, which applies *selective batching* and *iteration-level scheduling* (instead of request, each time run only a single iteration of the model) to a Transformer-based model.

### Background

* ML inference serving: serving system + execution engine
  * Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)

### Key observations

The *transformer-based generative models* generate a next token *in an autoregressive manner*, so they need to be executed *multiple times* to process an inference request.

### Design

* Key designs
  * Schedule the execution at the granularity of *iteration* instead of *request*
  * Selective batching
    * Split the batch and process each request individually for *the attention operation*
      * The decision not to batch the executions of the Attention operation has *only a small impact on efficiency*
    * Apply batching to *other operations*
* Others
  * Simple first-come-first-served algorithm
  * Adopt intra-layer and inter-layer model parallelism
  * Reserve "max tokens" slots of GPU memory for storing the keys & values in advance
  * Tune the maximum batch size to maximize throughput while satisfying one’s latency budget
  * Separate the communication channels for control messages (plus tokens) and tensor data transfer

### Implementation

* 13k lines of C++, based on CUDA.
* Used gRPC for communication in the control plane.
* Used NCCL in the data plane.
* Implemented fused kernels for LayerNorm, Attention, and GeLU operators.

### Evaluation

* Models
  * GPT-3 models (13B, 101B, 175B, 341B)
  * No actual model checkpoint
  * Synthesized the trace of client requests
* **Orca** outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.
