Large Language Model (LLM)

I am actively maintaining this list.

LLM Training

Hybrid parallelism

Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]
- Kuaishou
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]
- UC Berkeley & AWS & Google & SJTU & CMU & Duke
- Generalize the search through parallelism strategies.

Fault tolerance

Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]
- UMich SymbioticLab & AWS & PKU
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]
- Rice & AWS
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]
- UCLA & CMU & MSR & Princeton
- Resilient distributed training

LLM Inference

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]
- UChicago & Microsoft & Stanford
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]
- Apple
SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]
- CMU & PKU & CUHK
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]
- PKU
- Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
- Proactive KV cache swapping.
- Compared to Orca
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]
- UC Berkeley & PKU & UPenn & Stanford & Google
- Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.
Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]
- Google
- Outstanding Paper Award
- Model partitioning; PaLM; TPUv4
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]
- Microsoft DeepSpeed
- Leverage CPU/NVMe/GPU memory.

LLM-based Applications

Teola: Towards End-to-End Optimization of LLM-based Applications (ASPLOS 2025) [arXiv]
- CUHK
- An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.
- Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]
- SJTU & MSRA
SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) [Personal Notes] [Paper] [arXiv] [Code]
- UC Berkeley & Stanford
- Co-design the front-end programming interface and back-end serving runtime
- SGLang; SGVM w/ RadixAttention
- Reuse KV cache across multiple calls and programs

Retrieval-Augmented Generation (RAG)

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) [arXiv]
- Jeonbuk National University & Seoul National University
- Leverage query-independent, offline caching to reuse a context KV cache store.
- Cache Re-Positioning: shift keys to different positions in the encoding space.
- Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.
- Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) [arXiv]
- Adobe Research & IIT Bombay & IIT Kanpur
- Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.
- A wrapper around vLLM; built on Xformers backend optimized with Triton.
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) [arXiv]
- PKU & ByteDance
- Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.
- Replacement policy: evaluate each node based on its access frequency, size, and access cost.
  - Priority= Clock + (Frequency × Cost Size) / Size
  - Nodes with lower priority are evicted first.
- Built on vLLM.

Request Scheduling

Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]
- Alibaba
Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]
- Seoul National University & FriendliAI
- Iteration-level scheduling; selective batching.

KV Cache Management

Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]
- UC Berkeley & Stanford & UCSD
- vLLM, PagedAttention
- Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Prefill-Decode (PD) Disaggregation

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (FAST 2025) [Paper] [arXiv] [Slides] [Code]
- Mootshot AI & Tsinghua
- Best Paper Award
- Separate the prefill and decoding clusters; prediction-based early rejection.
- Distributed multi-layer KVCache pool; prefix-hashed KVCache object storage.
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]
- ICT, CAS & Huawei Cloud
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]
- PKU & UCSD
Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [Paper] [arXiv] [Blog]
- UW & Microsoft
- Best Paper Award
- Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

Chunked Prefill

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]
- MSR India & GaTech
- Sarathi-Serve

Serverless Inference

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference (arXiv:2502.09922) [arXiv]
- CUHK-SZ & UVA & HKUST & Alibaba & Nokia Bell Labs
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]
- Edinburgh

LoRA Serving

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]
- PKU & Shanghai AI Lab
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]
- HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]
- UC Berkeley
Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]
- UW & Duke

Position-Independent Caching (PIC)

EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) [arXiv]
- PKU & NJU & Huawei Cloud
- Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.
- Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.
- Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [arXiv] [Code]
- UChicago
- For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.
- Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.

Sparsity

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]
- Seoul National University
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (SOSP 2024) [Paper] [arXiv] [Code]
- SJTU
- A GPU-CPU hybrid inference engine
- Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]
- Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU
- A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

Speculative Decoding

Online Speculative Decoding (ICML 2024) [arXiv]
- UC Berkeley & UCSD & Sisu Data & SJTU
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]
- CMU
Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]
- UC Berkeley & ICSI & LBNL
Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]
- Google Research

Offloading

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]
- Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
- High-throughput serving; only use a single GPU.

Heterogeneous Environment

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) [arXiv]
- Cambridge & HKUST & PKU & ETH & Purdue
HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment (ICLR 2025) [Paper] [arXiv]
- HKUST
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]
- HKUST & ETH & CMU
- Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
- Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

Locality-aware Fair Scheduling in LLM Serving (arXiv:2501.14312) [arXiv]
- UC Berkeley
Fairness in Serving Large Language Models (OSDI 2024) [Paper] [Code]
- UC Berkeley

LLM Alignment

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]
- THU

Acronyms

LLM: Large Language Model
LoRA: Low-Rank Adaptation

Last updated 5 months ago

Was this helpful?