Large Language Model (LLM)
Last updated
Was this helpful?
Last updated
Was this helpful?
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism () [] []
Kuaishou
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning () [] [] []
UC Berkeley & AWS & Google & SJTU & CMU & Duke
Generalize the search through parallelism strategies.
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates () [] [] []
UMich SymbioticLab & AWS & PKU
Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints () []
Rice & AWS
Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs () [] []
UCLA & CMU & MSR & Princeton
Resilient distributed training
UChicago & Microsoft & Stanford
Apple
CMU & PKU & CUHK
PKU
Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.
Proactive KV cache swapping.
Compared to Orca
UC Berkeley & PKU & UPenn & Stanford & Google
Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.
Outstanding Paper Award
Model partitioning; PaLM; TPUv4
Microsoft DeepSpeed
Leverage CPU/NVMe/GPU memory.
CUHK
An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.
Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.
SJTU & MSRA
UC Berkeley & Stanford
Co-design the front-end programming interface and back-end serving runtime
SGLang; SGVM w/ RadixAttention
Reuse KV cache across multiple calls and programs
Jeonbuk National University & Seoul National University
Leverage query-independent, offline caching to reuse a context KV cache store.
Cache Re-Positioning: shift keys to different positions in the encoding space.
Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.
Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.
Adobe Research & IIT Bombay & IIT Kanpur
Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.
A wrapper around vLLM; built on Xformers backend optimized with Triton.
PKU & ByteDance
Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.
Replacement policy: evaluate each node based on its access frequency, size, and access cost.
Priority= Clock + (Frequency × Cost Size) / Size
Nodes with lower priority are evicted first.
Built on vLLM.
Alibaba
Seoul National University & FriendliAI
Iteration-level scheduling; selective batching.
UC Berkeley & Stanford & UCSD
vLLM, PagedAttention
Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens
Mootshot AI & Tsinghua
Best Paper Award
Separate the prefill and decoding clusters; prediction-based early rejection.
Distributed multi-layer KVCache pool; prefix-hashed KVCache object storage.
ICT, CAS & Huawei Cloud
PKU & UCSD
UW & Microsoft
Best Paper Award
Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines
MSR India & GaTech
Sarathi-Serve
CUHK-SZ & UVA & HKUST & Alibaba & Nokia Bell Labs
Edinburgh
PKU & Shanghai AI Lab
HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud
UC Berkeley
UW & Duke
PKU & NJU & Huawei Cloud
Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.
Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.
Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.
UChicago
For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.
Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.
Seoul National University
SJTU
A GPU-CPU hybrid inference engine
Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU
Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU
A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).
UC Berkeley & UCSD & Sisu Data & SJTU
CMU
UC Berkeley & ICSI & LBNL
Google Research
Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU
High-throughput serving; only use a single GPU.
Cambridge & HKUST & PKU & ETH & Purdue
HKUST
HKUST & ETH & CMU
Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)
Propose a heuristic-based evolutionary algorithm to search for the optimal layout
UC Berkeley
UC Berkeley
THU
LLM: Large Language Model
LoRA: Low-Rank Adaptation
CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving () [] [] []
Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization ()
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching ()
LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) []
SpotServe: Serving Generative Large Language Models on Preemptible Instances () [] [] []
Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) []
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving () [] []
Efficiently Scaling Transformer Inference () []
DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale () [] [] []
Teola: Towards End-to-End Optimization of LLM-based Applications () []
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable () [] []
SGLang: Efficient Execution of Structured Language Model Programs () [] [] [] []
CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) []
Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) []
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) []
Llumnix: Dynamic Scheduling for Large Language Model Serving () [] []
Orca: A Distributed Serving System for Transformer-Based Generative Models () [] []
Efficient Memory Management for Large Language Model Serving with PagedAttention () [] [] [] []
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (FAST 2025) [] [] [] []
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) []
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving () [] []
Splitwise: Efficient Generative LLM Inference Using Phase Splitting () [] [] []
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve () [] [] []
λScale: Enabling Fast Scaling for Serverless Large Language Model Inference (arXiv:2502.09922) []
ServerlessLLM: Low-Latency Serverless Inference for Large Language Models () [] [] []
dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving () []
CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) []
S-LoRA: Serving Thousands of Concurrent LoRA Adapters () [] []
Punica: Multi-Tenant LoRA Serving () [] []
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) []
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [] []
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management () []
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU () [] [] []
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time () [] []
Online Speculative Decoding () []
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification () [] []
Speculative Decoding with Big Little Decoder (NeurIPS 2023) []
Fast Inference from Transformers via Speculative Decoding () []
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU () [] [] []
Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) []
HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment (ICLR 2025) [] []
HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment () [] [] []
Locality-aware Fair Scheduling in LLM Serving (arXiv:2501.14312) []
Fairness in Serving Large Language Models () [] []
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch () []