High-throughput generative inference of large language models with a single GPU
An offloading framework for high-throughput LLM inference.
Meta Info
Presented in arxiv:2303.06865. (Accepted in ICML 2023)
Authors: Ying Sheng (Stanford), Lianmin Zheng (UC Berkeley), Binhang Yuan (ETH), Zhuohan Li (UC Berkeley), Max Ryabinin (Yandex & HSE University), Daniel Y. Fu, Zhiqiang Xie (Stanford), Beidi Chen (Meta & CMU), Clark Barrett (Stanford), Joseph E. Gonzalez (UC Berkeley), Percy Liang, Christopher Ré (Stanford), Ion Stoica (UC Berkeley), Ce Zhang (ETH).
Code: https://github.com/FMInference/FlexGen
Understanding the paper
TL;DRs
This paper presents FlexGen, an offloading framework for high-throughput LLM inference. It aggregates memory from the GPU, CPU, and disk, and efficiently schedules I/O operations, along with possible compression methods and distributed pipeline parallelism.
Evaluation
NVIDIA T4 GPU instances from Google Cloud.
GPU: NVIDIA T4, 16 GB
CPU: Intel Xeon @ 2.00GHz, 208 GB
Disk: Cloud default SSD (NVMe), 1.5 TB
Model
OPT models with 6.7B to 175 parameters.
Baseline
DeepSpeed-Inference
Implementation
Implemented on top of PyTorch.
Manage multiple CUDA streams and CPU threads to overlap I/O with compute.
Create files for tensors stored on the disk and maps them as virtual memory.
Last updated