# High-throughput generative inference of large language models with a single GPU

## Meta Info

Presented in [arxiv:2303.06865](https://arxiv.org/abs/2303.06865). (Accepted in ICML 2023)

Authors: Ying Sheng (*Stanford*), Lianmin Zheng (*UC Berkeley*), Binhang Yuan (*ETH*), Zhuohan Li (*UC Berkeley*), Max Ryabinin (*Yandex & HSE University*), Daniel Y. Fu, Zhiqiang Xie (*Stanford*), Beidi Chen (*Meta & CMU*), Clark Barrett (*Stanford*), Joseph E. Gonzalez (*UC Berkeley*), Percy Liang, Christopher Ré (*Stanford*), Ion Stoica (*UC Berkeley*), Ce Zhang (*ETH*).

Code: <https://github.com/FMInference/FlexGen>

## Understanding the paper

### TL;DRs

This paper presents **FlexGen**, *an offloading framework* for *high-throughput LLM inference*.\
It *aggregates memory* from the *GPU, CPU, and disk*, and efficiently *schedules I/O operations*, along with *possible compression methods* and *distributed pipeline parallelism*.

### Evaluation

* NVIDIA T4 GPU instances from Google Cloud.
  * GPU: NVIDIA T4, 16 GB
  * CPU: Intel Xeon @ 2.00GHz, 208 GB
  * Disk: Cloud default SSD (NVMe), 1.5 TB
* Model
  * [OPT](https://ai.facebook.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/) models with 6.7B to 175 parameters.
* Baseline
  * DeepSpeed-Inference

### Implementation

* Implemented on top of **PyTorch**.
* Manage *multiple CUDA streams* and *CPU threads* to overlap I/O with compute.
* Create files for tensors stored on the disk and maps them as virtual memory.
