Orca: A distributed serving system for transformer-based generative models

#distributed_serving_system #batch_serving #selective_batching #transformer-based_model #iteration-level_scheduling

Meta Info

Presented in OSDI 2022.

Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (Seoul National University, FriendliAI)

Understanding the paper

TL;DRs

This paper presents a distributed serving system called Orca, which applies selective batching and iteration-level scheduling (instead of request, each time run only a single iteration of the model) to a Transformer-based model.

Background

ML inference serving: serving system + execution engine
- Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)

Key observations

The transformer-based generative models generate a next token in an autoregressive manner, so they need to be executed multiple times to process an inference request.

Design

Key designs
- Schedule the execution at the granularity of iteration instead of request
- Selective batching
  - Split the batch and process each request individually for the attention operation
    The decision not to batch the executions of the Attention operation has only a small impact on efficiency
  - Apply batching to other operations
Others
- Simple first-come-first-served algorithm
- Adopt intra-layer and inter-layer model parallelism
- Reserve "max tokens" slots of GPU memory for storing the keys & values in advance
- Tune the maximum batch size to maximize throughput while satisfying one’s latency budget
- Separate the communication channels for control messages (plus tokens) and tensor data transfer

Implementation

13k lines of C++, based on CUDA.
Used gRPC for communication in the control plane.
Used NCCL in the data plane.
Implemented fused kernels for LayerNorm, Attention, and GeLU operators.

Evaluation

Models
- GPT-3 models (13B, 101B, 175B, 341B)
- No actual model checkpoint
- Synthesized the trace of client requests
Orca outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.

Last updated 1 year ago

Was this helpful?