High-throughput generative inference of large language models with a single GPU

An offloading framework for high-throughput LLM inference.

Meta Info

Presented in arxiv:2303.06865. (Accepted in ICML 2023)

Authors: Ying Sheng (Stanford), Lianmin Zheng (UC Berkeley), Binhang Yuan (ETH), Zhuohan Li (UC Berkeley), Max Ryabinin (Yandex & HSE University), Daniel Y. Fu, Zhiqiang Xie (Stanford), Beidi Chen (Meta & CMU), Clark Barrett (Stanford), Joseph E. Gonzalez (UC Berkeley), Percy Liang, Christopher Ré (Stanford), Ion Stoica (UC Berkeley), Ce Zhang (ETH).

Code: https://github.com/FMInference/FlexGen

Understanding the paper

TL;DRs

This paper presents FlexGen, an offloading framework for high-throughput LLM inference. It aggregates memory from the GPU, CPU, and disk, and efficiently schedules I/O operations, along with possible compression methods and distributed pipeline parallelism.

Evaluation

  • NVIDIA T4 GPU instances from Google Cloud.

    • GPU: NVIDIA T4, 16 GB

    • CPU: Intel Xeon @ 2.00GHz, 208 GB

    • Disk: Cloud default SSD (NVMe), 1.5 TB

  • Model

    • OPT models with 6.7B to 175 parameters.

  • Baseline

    • DeepSpeed-Inference

Implementation

  • Implemented on top of PyTorch.

  • Manage multiple CUDA streams and CPU threads to overlap I/O with compute.

  • Create files for tensors stored on the disk and maps them as virtual memory.

Last updated