Large Language Model (LLM)

I am actively maintaining this list.

LLM Training

Hybrid parallelism

  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]

    • Kuaishou

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]

    • UC Berkeley & AWS & Google & SJTU & CMU & Duke

    • Generalize the search through parallelism strategies.

Fault tolerance

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]

    • UMich SymbioticLab & AWS & PKU

  • Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]

    • Rice & AWS

  • Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]

    • UCLA & CMU & MSR & Princeton

    • Resilient distributed training

LLM Inference

  • CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]

    • UChicago & Microsoft & Stanford

  • Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)

  • ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]

    • Apple

  • SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]

    • CMU & PKU & CUHK

  • Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]

    • PKU

    • Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.

    • Proactive KV cache swapping.

    • Compared to Orca

  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]

    • UC Berkeley & PKU & UPenn & Stanford & Google

    • Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.

  • Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]

    • Google

    • Outstanding Paper Award

    • Model partitioning; PaLM; TPUv4

  • DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]

    • Microsoft DeepSpeed

    • Leverage CPU/NVMe/GPU memory.

LLM-based Applications

  • Teola: Towards End-to-End Optimization of LLM-based Applications (ASPLOS 2025) [arXiv]

    • CUHK

    • An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.

    • Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]

    • SJTU & MSRA

  • SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) [Personal Notes] [Paper] [arXiv] [Code]

    • UC Berkeley & Stanford

    • Co-design the front-end programming interface and back-end serving runtime

    • SGLang; SGVM w/ RadixAttention

    • Reuse KV cache across multiple calls and programs

Retrieval-Augmented Generation (RAG)

  • CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) [arXiv]

    • Jeonbuk National University & Seoul National University

    • Leverage query-independent, offline caching to reuse a context KV cache store.

    • Cache Re-Positioning: shift keys to different positions in the encoding space.

    • Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.

    • Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.

  • Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) [arXiv]

    • Adobe Research & IIT Bombay & IIT Kanpur

    • Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.

    • A wrapper around vLLM; built on Xformers backend optimized with Triton.

  • RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) [arXiv]

    • PKU & ByteDance

    • Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.

    • Replacement policy: evaluate each node based on its access frequency, size, and access cost.

      • Priority= Clock + (Frequency × Cost Size) / Size

      • Nodes with lower priority are evicted first.

    • Built on vLLM.

Request Scheduling

  • Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]

    • Alibaba

  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]

    • Seoul National University & FriendliAI

    • Iteration-level scheduling; selective batching.

KV Cache Management

  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]

    • UC Berkeley & Stanford & UCSD

    • vLLM, PagedAttention

    • Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Prefill-Decode (PD) Disaggregation

  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (FAST 2025) [Paper] [arXiv] [Slides] [Code]

    • Mootshot AI & Tsinghua

    • Best Paper Award

    • Separate the prefill and decoding clusters; prediction-based early rejection.

    • Distributed multi-layer KVCache pool; prefix-hashed KVCache object storage.

  • Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]

    • ICT, CAS & Huawei Cloud

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]

    • PKU & UCSD

  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [Paper] [arXiv] [Blog]

    • UW & Microsoft

    • Best Paper Award

    • Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

Chunked Prefill

  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]

    • MSR India & GaTech

    • Sarathi-Serve

Serverless Inference

  • λScale: Enabling Fast Scaling for Serverless Large Language Model Inference (arXiv:2502.09922) [arXiv]

    • CUHK-SZ & UVA & HKUST & Alibaba & Nokia Bell Labs

  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]

    • Edinburgh

LoRA Serving

  • dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]

    • PKU & Shanghai AI Lab

  • CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]

    • HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]

    • UC Berkeley

  • Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]

    • UW & Duke

Position-Independent Caching (PIC)

  • EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) [arXiv]

    • PKU & NJU & Huawei Cloud

    • Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.

    • Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.

    • Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.

  • CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [arXiv] [Code]

    • UChicago

    • For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.

    • Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.

Sparsity

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]

    • Seoul National University

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (SOSP 2024) [Paper] [arXiv] [Code]

    • SJTU

    • A GPU-CPU hybrid inference engine

    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]

    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU

    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

Speculative Decoding

  • Online Speculative Decoding (ICML 2024) [arXiv]

    • UC Berkeley & UCSD & Sisu Data & SJTU

  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]

    • CMU

  • Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]

    • UC Berkeley & ICSI & LBNL

  • Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]

    • Google Research

Offloading

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]

    • Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU

    • High-throughput serving; only use a single GPU.

Heterogeneous Environment

  • Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) [arXiv]

    • Cambridge & HKUST & PKU & ETH & Purdue

  • HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment (ICLR 2025) [Paper] [arXiv]

    • HKUST

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]

    • HKUST & ETH & CMU

    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)

    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

  • Locality-aware Fair Scheduling in LLM Serving (arXiv:2501.14312) [arXiv]

    • UC Berkeley

  • Fairness in Serving Large Language Models (OSDI 2024) [Paper] [Code]

    • UC Berkeley

LLM Alignment

  • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]

    • THU

Acronyms

  • LLM: Large Language Model

  • LoRA: Low-Rank Adaptation

Last updated

Was this helpful?