Large Language Model (LLM)
I am actively maintaining this list.
LLM Training
Hybrid parallelism
Fault tolerance
LLM Inference
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
UChicago & Microsoft & Stanford
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]
UC Berkeley & Stanford
Co-design the front-end programming interface and back-end serving runtime
SGLang; SGVM w/ RadixAttention
Reuse KV cache across multiple calls and programs
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]
SJTU
A GPU-CPU hybrid inference engine
Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
Apple
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
CMU & PKU & CUHK
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
PKU
Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
Proactive KV cache swapping.
Compared to Orca
Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
Google
Outstanding Paper Award
Model partitioning; PaLM; TPUv4
Request Scheduling
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
Seoul National University & FriendliAI
Iteration-level scheduling; selective batching.
KV Cache Management
Phase Disaggregation
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
ICT, CAS & Huawei Cloud
LoRA Serving
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
UC Berkeley
Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
UW & Duke
Speculative Decoding
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
CMU
Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
UC Berkeley & ICSI & LBNL
Offloading
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
High-throughput serving; only use a single GPU.
Heterogeneous Environment
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
HKUST & ETH & CMU
Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
Propose a heuristic-based evolutionary algorithm to search for the optimal layout
Fairness
LLM Alignment
Acronyms
LLM: Large Language Model
LoRA: Low-Rank Adaptation
Last updated