Orca: A distributed serving system for transformer-based generative models
#distributed_serving_system #batch_serving #selective_batching #transformer-based_model #iteration-level_scheduling
Meta Info
Presented in OSDI 2022.
Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (Seoul National University, FriendliAI)
Understanding the paper
TL;DRs
This paper presents a distributed serving system called Orca, which applies selective batching and iteration-level scheduling (instead of request, each time run only a single iteration of the model) to a Transformer-based model.
Background
ML inference serving: serving system + execution engine
Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)
Key observations
The transformer-based generative models generate a next token in an autoregressive manner, so they need to be executed multiple times to process an inference request.
Design
Key designs
Schedule the execution at the granularity of iteration instead of request
Selective batching
Split the batch and process each request individually for the attention operation
The decision not to batch the executions of the Attention operation has only a small impact on efficiency
Apply batching to other operations
Others
Simple first-come-first-served algorithm
Adopt intra-layer and inter-layer model parallelism
Reserve "max tokens" slots of GPU memory for storing the keys & values in advance
Tune the maximum batch size to maximize throughput while satisfying one’s latency budget
Separate the communication channels for control messages (plus tokens) and tensor data transfer
Implementation
13k lines of C++, based on CUDA.
Used gRPC for communication in the control plane.
Used NCCL in the data plane.
Implemented fused kernels for LayerNorm, Attention, and GeLU operators.
Evaluation
Models
GPT-3 models (13B, 101B, 175B, 341B)
No actual model checkpoint
Synthesized the trace of client requests
Orca outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.
Last updated