High-throughput generative inference of large language models with a single GPU

An offloading framework for high-throughput LLM inference.

Meta Info

Presented in arxiv:2303.06865. (Accepted in ICML 2023)

Authors: Ying Sheng (Stanford), Lianmin Zheng (UC Berkeley), Binhang Yuan (ETH), Zhuohan Li (UC Berkeley), Max Ryabinin (Yandex & HSE University), Daniel Y. Fu, Zhiqiang Xie (Stanford), Beidi Chen (Meta & CMU), Clark Barrett (Stanford), Joseph E. Gonzalez (UC Berkeley), Percy Liang, Christopher Ré (Stanford), Ion Stoica (UC Berkeley), Ce Zhang (ETH).

Code: https://github.com/FMInference/FlexGen

Understanding the paper

TL;DRs

This paper presents FlexGen, an offloading framework for high-throughput LLM inference. It aggregates memory from the GPU, CPU, and disk, and efficiently schedules I/O operations, along with possible compression methods and distributed pipeline parallelism.

Evaluation

NVIDIA T4 GPU instances from Google Cloud.
- GPU: NVIDIA T4, 16 GB
- CPU: Intel Xeon @ 2.00GHz, 208 GB
- Disk: Cloud default SSD (NVMe), 1.5 TB
Model
- OPT models with 6.7B to 175 parameters.
Baseline
- DeepSpeed-Inference

Implementation

Implemented on top of PyTorch.
Manage multiple CUDA streams and CPU threads to overlap I/O with compute.
Create files for tensors stored on the disk and maps them as virtual memory.

Last updated 2 years ago

Was this helpful?