Orca: A distributed serving system for transformer-based generative models

#distributed_serving_system #batch_serving #selective_batching #transformer-based_model #iteration-level_scheduling

Meta Info

Presented in OSDI 2022.

Authors: Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, Byung-Gon Chun (Seoul National University, FriendliAI)

Understanding the paper

TL;DRs

This paper presents a distributed serving system called Orca, which applies selective batching and iteration-level scheduling (instead of request, each time run only a single iteration of the model) to a Transformer-based model.

Background

  • ML inference serving: serving system + execution engine

    • Example: Triton (group multiple client requests into a batch) + FasterTransformer (conduct the inference procedure in the batched manner)

Key observations

The transformer-based generative models generate a next token in an autoregressive manner, so they need to be executed multiple times to process an inference request.

Design

  • Key designs

    • Schedule the execution at the granularity of iteration instead of request

    • Selective batching

      • Split the batch and process each request individually for the attention operation

        • The decision not to batch the executions of the Attention operation has only a small impact on efficiency

      • Apply batching to other operations

  • Others

    • Simple first-come-first-served algorithm

    • Adopt intra-layer and inter-layer model parallelism

    • Reserve "max tokens" slots of GPU memory for storing the keys & values in advance

    • Tune the maximum batch size to maximize throughput while satisfying one’s latency budget

    • Separate the communication channels for control messages (plus tokens) and tensor data transfer

Implementation

  • 13k lines of C++, based on CUDA.

  • Used gRPC for communication in the control plane.

  • Used NCCL in the data plane.

  • Implemented fused kernels for LayerNorm, Attention, and GeLU operators.

Evaluation

  • Models

    • GPT-3 models (13B, 101B, 175B, 341B)

    • No actual model checkpoint

    • Synthesized the trace of client requests

  • Orca outperforms NVIDIA FasterTransformer: 36.9x throughput improvement at the same level of latency.

Last updated