Large Language Model (LLM)
I am actively maintaining this list.
LLM Training
Hybrid parallelism
Fault tolerance
LLM Inference
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
UChicago & Microsoft & Stanford
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
Apple
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
CMU & PKU & CUHK
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
PKU
Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
Proactive KV cache swapping.
Compared to Orca
Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
Google
Outstanding Paper Award
Model partitioning; PaLM; TPUv4
LLM-based Applications
Teola: Towards End-to-End Optimization of LLM-based Applications (ASPLOS 2025) [arXiv]
CUHK
An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.
Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.
SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) [Personal Notes] [Paper] [arXiv] [Code]
UC Berkeley & Stanford
Co-design the front-end programming interface and back-end serving runtime
SGLang; SGVM w/ RadixAttention
Reuse KV cache across multiple calls and programs
Retrieval-Augmented Generation (RAG)
CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) [arXiv]
Jeonbuk National University & Seoul National University
Leverage query-independent, offline caching to reuse a context KV cache store.
Cache Re-Positioning: shift keys to different positions in the encoding space.
Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.
Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) [arXiv]
Adobe Research & IIT Bombay & IIT Kanpur
Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.
A wrapper around vLLM; built on Xformers backend optimized with Triton.
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) [arXiv]
PKU & ByteDance
Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.
Replacement policy: evaluate each node based on its access frequency, size, and access cost.
Priority= Clock + (Frequency × Cost Size) / Size
Nodes with lower priority are evicted first.
Built on vLLM.
Request Scheduling
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
Seoul National University & FriendliAI
Iteration-level scheduling; selective batching.
KV Cache Management
Prefill-Decode (PD) Disaggregation
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (FAST 2025) [Paper] [arXiv] [Slides] [Code]
Mootshot AI & Tsinghua
Best Paper Award
Separate the prefill and decoding clusters; prediction-based early rejection.
Distributed multi-layer KVCache pool; prefix-hashed KVCache object storage.
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
ICT, CAS & Huawei Cloud
Chunked Prefill
Serverless Inference
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference (arXiv:2502.09922) [arXiv]
CUHK-SZ & UVA & HKUST & Alibaba & Nokia Bell Labs
LoRA Serving
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
UC Berkeley
Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
UW & Duke
Position-Independent Caching (PIC)
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) [arXiv]
PKU & NJU & Huawei Cloud
Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.
Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.
Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [arXiv] [Code]
UChicago
For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.
Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.
Sparsity
Speculative Decoding
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
CMU
Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
UC Berkeley & ICSI & LBNL
Offloading
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
High-throughput serving; only use a single GPU.
Heterogeneous Environment
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) [arXiv]
Cambridge & HKUST & PKU & ETH & Purdue
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
HKUST & ETH & CMU
Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
Propose a heuristic-based evolutionary algorithm to search for the optimal layout
Fairness
Locality-aware Fair Scheduling in LLM Serving (arXiv:2501.14312) [arXiv]
UC Berkeley
LLM Alignment
Acronyms
LLM: Large Language Model
LoRA: Low-Rank Adaptation
Last updated
Was this helpful?