📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • LLM Training
  • Hybrid parallelism
  • Fault tolerance
  • LLM Inference
  • LLM-based Applications
  • Retrieval-Augmented Generation (RAG)
  • Request Scheduling
  • KV Cache Management
  • Prefill-Decode (PD) Disaggregation
  • Chunked Prefill
  • Serverless Inference
  • LoRA Serving
  • Position-Independent Caching (PIC)
  • Sparsity
  • Speculative Decoding
  • Offloading
  • Heterogeneous Environment
  • Fairness
  • LLM Alignment
  • Acronyms

Was this helpful?

Edit on GitHub
  1. Paper List
  2. Systems for ML

Large Language Model (LLM)

Last updated 2 months ago

Was this helpful?

I am actively maintaining this list.

LLM Training

Hybrid parallelism

  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism () [] []

    • Kuaishou

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning () [] [] []

    • UC Berkeley & AWS & Google & SJTU & CMU & Duke

    • Generalize the search through parallelism strategies.

Fault tolerance

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates () [] [] []

    • UMich SymbioticLab & AWS & PKU

  • Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints () []

    • Rice & AWS

  • Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs () [] []

    • UCLA & CMU & MSR & Princeton

    • Resilient distributed training

LLM Inference

    • UChicago & Microsoft & Stanford

    • Apple

    • CMU & PKU & CUHK

    • PKU

    • Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.

    • Proactive KV cache swapping.

    • Compared to Orca

    • UC Berkeley & PKU & UPenn & Stanford & Google

    • Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.

    • Google

    • Outstanding Paper Award

    • Model partitioning; PaLM; TPUv4

    • Microsoft DeepSpeed

    • Leverage CPU/NVMe/GPU memory.

LLM-based Applications

    • CUHK

    • An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.

    • Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.

    • SJTU & MSRA

    • UC Berkeley & Stanford

    • Co-design the front-end programming interface and back-end serving runtime

    • SGLang; SGVM w/ RadixAttention

    • Reuse KV cache across multiple calls and programs

Retrieval-Augmented Generation (RAG)

    • Jeonbuk National University & Seoul National University

    • Leverage query-independent, offline caching to reuse a context KV cache store.

    • Cache Re-Positioning: shift keys to different positions in the encoding space.

    • Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.

    • Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.

    • Adobe Research & IIT Bombay & IIT Kanpur

    • Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.

    • A wrapper around vLLM; built on Xformers backend optimized with Triton.

    • PKU & ByteDance

    • Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.

    • Replacement policy: evaluate each node based on its access frequency, size, and access cost.

      • Priority= Clock + (Frequency × Cost Size) / Size

      • Nodes with lower priority are evicted first.

    • Built on vLLM.

Request Scheduling

    • Alibaba

    • Seoul National University & FriendliAI

    • Iteration-level scheduling; selective batching.

KV Cache Management

    • UC Berkeley & Stanford & UCSD

    • vLLM, PagedAttention

    • Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Prefill-Decode (PD) Disaggregation

    • Mootshot AI & Tsinghua

    • Best Paper Award

    • Separate the prefill and decoding clusters; prediction-based early rejection.

    • Distributed multi-layer KVCache pool; prefix-hashed KVCache object storage.

    • ICT, CAS & Huawei Cloud

    • PKU & UCSD

    • UW & Microsoft

    • Best Paper Award

    • Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

Chunked Prefill

    • MSR India & GaTech

    • Sarathi-Serve

Serverless Inference

    • CUHK-SZ & UVA & HKUST & Alibaba & Nokia Bell Labs

    • Edinburgh

LoRA Serving

    • PKU & Shanghai AI Lab

    • HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud

    • UC Berkeley

    • UW & Duke

Position-Independent Caching (PIC)

    • PKU & NJU & Huawei Cloud

    • Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.

    • Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.

    • Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.

    • UChicago

    • For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.

    • Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.

Sparsity

    • Seoul National University

    • SJTU

    • A GPU-CPU hybrid inference engine

    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU

    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU

    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

Speculative Decoding

    • UC Berkeley & UCSD & Sisu Data & SJTU

    • CMU

    • UC Berkeley & ICSI & LBNL

    • Google Research

Offloading

    • Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU

    • High-throughput serving; only use a single GPU.

Heterogeneous Environment

    • Cambridge & HKUST & PKU & ETH & Purdue

    • HKUST

    • HKUST & ETH & CMU

    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)

    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

    • UC Berkeley

    • UC Berkeley

LLM Alignment

    • THU

Acronyms

  • LLM: Large Language Model

  • LoRA: Low-Rank Adaptation

CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving () [] [] []

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization ()

ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching ()

LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) []

SpotServe: Serving Generative Large Language Models on Preemptible Instances () [] [] []

Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) []

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving () [] []

Efficiently Scaling Transformer Inference () []

DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale () [] [] []

Teola: Towards End-to-End Optimization of LLM-based Applications () []

Parrot: Efficient Serving of LLM-based Applications with Semantic Variable () [] []

SGLang: Efficient Execution of Structured Language Model Programs () [] [] [] []

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) []

Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) []

RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) []

Llumnix: Dynamic Scheduling for Large Language Model Serving () [] []

Orca: A Distributed Serving System for Transformer-Based Generative Models () [] []

Efficient Memory Management for Large Language Model Serving with PagedAttention () [] [] [] []

Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (FAST 2025) [] [] [] []

Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) []

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving () [] []

Splitwise: Efficient Generative LLM Inference Using Phase Splitting () [] [] []

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve () [] [] []

λScale: Enabling Fast Scaling for Serverless Large Language Model Inference (arXiv:2502.09922) []

ServerlessLLM: Low-Latency Serverless Inference for Large Language Models () [] [] []

dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving () []

CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) []

S-LoRA: Serving Thousands of Concurrent LoRA Adapters () [] []

Punica: Multi-Tenant LoRA Serving () [] []

EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) []

CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [] []

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management () []

PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU () [] [] []

Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time () [] []

Online Speculative Decoding () []

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification () [] []

Speculative Decoding with Big Little Decoder (NeurIPS 2023) []

Fast Inference from Transformers via Speculative Decoding () []

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU () [] [] []

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) []

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment (ICLR 2025) [] []

HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment () [] [] []

Locality-aware Fair Scheduling in LLM Serving (arXiv:2501.14312) []

Fairness in Serving Large Language Models () [] []

PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch () []

ATC 2024
Paper
Code
OSDI 2022
Paper
Code
Docs
SOSP 2023
Paper
arXiv
Code
SOSP 2023
Paper
NSDI 2023
Paper
Code
SIGCOMM 2024
arXiv
Code
Video
ISCA 2024
ISCA 2024
arXiv
ASPLOS 2024
Personal Notes
arXiv
Code
Paper
OSDI 2023
Paper
Code
MLSys 2023
Paper
SC 2022
Paper
Code
Homepage
ASPLOS 2025
arXiv
OSDI 2024
Paper
Code
NeurIPS 2024
Personal Notes
Paper
arXiv
Code
arXiv
arXiv
arXiv
OSDI 2024
Paper
Code
OSDI 2022
Personal Notes
Paper
SOSP 2023
Paper
arXiv
Code
Homepage
Paper
arXiv
Slides
Code
arXiv
OSDI 2024
Paper
Code
ISCA 2024
Paper
arXiv
Blog
OSDI 2024
Paper
Code
arXiv
arXiv
OSDI 2024
Paper
Code
arXiv
OSDI 2024
Paper
arXiv
MLSys 2024
arXiv
Code
MLSys 2024
arXiv
Code
arXiv
arXiv
Code
OSDI 2024
Paper
SOSP 2024
Paper
arXiv
Code
ICML 2023
Paper
Code
ICML 2024
arXiv
ASPLOS 2024
arXiv
Code
Paper
ICML 2023
Paper
ICML 2023
Personal Notes
Paper
Code
arXiv
Paper
arXiv
ICML 2024
Personal Notes
arXiv
Code
arXiv
OSDI 2024
Paper
Code
ATC 2024
Paper