githubEdit

Large Language Model (LLM)

circle-info

I am actively maintaining this list.

LLM Training

Hybrid parallelism

Fault tolerance

LLM Inference

LLM-based Applications

  • Teola: Towards End-to-End Optimization of LLM-based Applications (ASPLOS 2025) [arXivarrow-up-right]

    • CUHK

    • An orchestration framework for LLM-based applications: utilize task primitives as the basic units; represent each query’s workflow as a primitive-level dataflow graph.

    • Enable larger design space for optimization including graph optimization (i.e., parallelization and pipelining) and application-aware scheduling.

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paperarrow-up-right] [Codearrow-up-right]

    • SJTU & MSRA

  • SGLang: Efficient Execution of Structured Language Model Programs (NeurIPS 2024) [Personal Notes] [Paperarrow-up-right] [arXivarrow-up-right] [Codearrow-up-right]

    • UC Berkeley & Stanford

    • Co-design the front-end programming interface and back-end serving runtime

    • SGLang; SGVM w/ RadixAttention

    • Reuse KV cache across multiple calls and programs

Retrieval-Augmented Generation (RAG)

  • CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation (arXiv:2502.11101) [arXivarrow-up-right]

    • Jeonbuk National University & Seoul National University

    • Leverage query-independent, offline caching to reuse a context KV cache store.

    • Cache Re-Positioning: shift keys to different positions in the encoding space.

    • Layer-Adaptive Cache Pruning: discard low-relevance caches for documents during pre-filling.

    • Adaptive Positional Allocation: adjust cache positions to maximize the use of the available positional encoding range.

  • Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation (SIGMOD 2025) [arXivarrow-up-right]

    • Adobe Research & IIT Bombay & IIT Kanpur

    • Identify the reusability of chunk-caches; perform a small fraction of recomputation to fix the cache to maintain output quality; store and evict chunk-caches.

    • A wrapper around vLLM; built on Xformers backend optimized with Triton.

  • RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation (arXiv:2404.12457) [arXivarrow-up-right]

    • PKU & ByteDance

    • Organize the intermediate states of retrieved knowledge in a knowledge tree; cache them in the GPU and host memory.

    • Replacement policy: evaluate each node based on its access frequency, size, and access cost.

      • Priority= Clock + (Frequency × Cost Size) / Size

      • Nodes with lower priority are evicted first.

    • Built on vLLM.

Request Scheduling

KV Cache Management

Prefill-Decode (PD) Disaggregation

Chunked Prefill

Serverless Inference

LoRA Serving

Position-Independent Caching (PIC)

  • EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models (ICML 2025) [arXivarrow-up-right]

    • PKU & NJU & Huawei Cloud

    • Key insight: the initial tokens of each chunk separately absorb a disproportionate amount of attention, preventing subsequent tokens from attending to relevant parts.

    • Propose an algorithm named LegoLink to recompute k (≤ 32) initial tokens on each chunk (except the first chunk) → Recognize their non-initial status and cripple their attention-absorbing ability.

    • Compared to CacheBlend, LegoLink reduces recomputation complexity and relies on static attention sparsity.

  • CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion (arXiv:2405.16444) [arXivarrow-up-right] [Codearrow-up-right]

    • UChicago

    • For an LLM input including multiple text chunks, reuse all KV caches but re-compute a small fraction of KV.

    • Objective: have both the speed of full KV reuse and the generation quality of full KV recompute.

Sparsity

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paperarrow-up-right]

    • Seoul National University

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (SOSP 2024) [Paperarrow-up-right] [arXivarrow-up-right] [Codearrow-up-right]

    • SJTU

    • A GPU-CPU hybrid inference engine

    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paperarrow-up-right] [Codearrow-up-right]

    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU

    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

Speculative Decoding

Offloading

Heterogeneous Environment

  • Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs (arXiv:2502.00722) [arXivarrow-up-right]

    • Cambridge & HKUST & PKU & ETH & Purdue

  • HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment (ICLR 2025) [Paperarrow-up-right] [arXivarrow-up-right]

    • HKUST

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXivarrow-up-right] [Codearrow-up-right]

    • HKUST & ETH & CMU

    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)

    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

LLM Alignment

Acronyms

  • LLM: Large Language Model

  • LoRA: Low-Rank Adaptation

Last updated