This paper presents FlexGen, an offloading framework for high-throughput LLM inference.
It aggregates memory from the GPU, CPU, and disk, and efficiently schedules I/O operations, along with possible compression methods and distributed pipeline parallelism.