Large Language Model (LLM)
I am actively maintaining this list.
LLM Training
LLM Inference
Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]
UC Berkeley & Stanford
Co-design the front-end programming interface and back-end serving runtime
SGLang; SGVM w/ RadixAttention
Reuse KV cache across multiple calls and programs
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]
SJTU
A GPU-CPU hybrid inference engine
Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
Apple
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
CMU & PKU & CUHK
Distributed LLM serving system on preemptible/spot instances
Techniques
Dynamically adapt the LLM parallelization configuration
Minimize the cost of migrating instances for dynamic reparallelization
Stateful inference recovery
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (arXiv 2311.11514) [Personal Notes] [arXiv] [Code]
HKUST & ETH & CMU
Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
Propose a heuristic-based evolutionary algorithm to search for the optimal layout
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (arXiv 2311.03285) [arXiv] [Code]
UC Berkeley
A system to serve many LoRA adapters
Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory
Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths
Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation
Built on top of LightLLM
Punica: Multi-Tenant LoRA Serving (arXiv 2310.18547) [arXiv] [Code]
UW & Duke
A system to serve multiple LoRA models in a shared GPU cluster
A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)
Batch GPU operations for concurrent execution of different LoRA models
A GPU only needs to store a single copy of the pre-trained model
A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads
Route the new request to a small set of active GPUs
Allocate additional GPU resources when the existing GPUs are fully utilized
Periodically migrate existing requests for consolidation
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
High-throughput serving; only use a single GPU.
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
PKU
Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
Proactive KV cache swapping.
Compared to Orca
Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
Google
Outstanding Paper Award
Model partitioning; PaLM; TPUv4
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
Seoul National University & FriendliAI
Iteration-level scheduling; selective batching.
Speculative Decoding
Online Speculative Decoding (arXiv: 2310.07177) [Paper]
UC Berkeley & UCSD & Sisu Data & SJTU
Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
UC Berkeley & ICSI & LBNL
LLMs
Acronyms
LLM: Large Language Model
Last updated