Large Language Model (LLM)

I am actively maintaining this list.

LLM Training

Hybrid parallelism

  • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism (ATC 2024) [Paper] [Code]

    • Kuaishou

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]

    • UC Berkeley & AWS & Google & SJTU & CMU & Duke

    • Generalize the search through parallelism strategies.

Fault tolerance

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]

    • UMich SymbioticLab & AWS & PKU

  • Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]

    • Rice & AWS

  • Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]

    • UCLA & CMU & MSR & Princeton

    • Resilient distributed training

LLM Inference

  • CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving (SIGCOMM 2024) [arXiv] [Code] [Video]

    • UChicago & Microsoft & Stanford

  • Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve (OSDI 2024) [Paper] [Code] [arXiv]

    • MSR India & GaTech

    • Sarathi-Serve

  • ServerlessLLM: Low-Latency Serverless Inference for Large Language Models (OSDI 2024) [Paper] [Code] [arXiv]

    • Edinburgh

  • Parrot: Efficient Serving of LLM-based Applications with Semantic Variable (OSDI 2024) [Paper] [Code]

    • SJTU & MSRA

  • Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization (ISCA 2024)

  • ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching (ISCA 2024)

  • Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]

    • UC Berkeley & Stanford

    • Co-design the front-end programming interface and back-end serving runtime

    • SGLang; SGVM w/ RadixAttention

    • Reuse KV cache across multiple calls and programs

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]

    • SJTU

    • A GPU-CPU hybrid inference engine

    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]

    • Apple

  • SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]

    • CMU & PKU & CUHK

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]

    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU

    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

  • Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]

    • PKU

    • Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.

    • Proactive KV cache swapping.

    • Compared to Orca

  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]

    • UC Berkeley & PKU & UPenn & Stanford & Google

    • Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.

  • Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]

    • Google

    • Outstanding Paper Award

    • Model partitioning; PaLM; TPUv4

  • DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]

    • Microsoft DeepSpeed

    • Leverage CPU/NVMe/GPU memory.

Request Scheduling

  • Llumnix: Dynamic Scheduling for Large Language Model Serving (OSDI 2024) [Paper] [Code]

    • Alibaba

  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]

    • Seoul National University & FriendliAI

    • Iteration-level scheduling; selective batching.

KV Cache Management

  • InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI 2024) [Paper]

    • Seoul National University

  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]

    • UC Berkeley & Stanford & UCSD

    • vLLM, PagedAttention

    • Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

Phase Disaggregation

  • Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving (arXiv:2407.00079) [arXiv] [Code]

    • Mootshot AI & Tsinghua

    • Separate the prefill and decoding clusters; prediction-based early rejection.

  • Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads (arXiv:2401.11181) [arXiv]

    • ICT, CAS & Huawei Cloud

  • DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 2024) [Paper] [Code]

    • PKU & UCSD

  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (ISCA 2024) [arXiv] [Blog]

    • UW & Microsoft

    • Best Paper Award

    • Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

LoRA Serving

  • dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving (OSDI 2024) [Paper]

    • PKU & Shanghai AI Lab

  • CaraServe: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference (arXiv:2401.11240) [arXiv]

    • HKUST & CUHK-Shenzhen & Shanghai AI Lab & Huawei Cloud

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters (MLSys 2024) [arXiv] [Code]

    • UC Berkeley

  • Punica: Multi-Tenant LoRA Serving (MLSys 2024) [arXiv] [Code]

    • UW & Duke

Speculative Decoding

  • Online Speculative Decoding (ICML 2024) [arXiv]

    • UC Berkeley & UCSD & Sisu Data & SJTU

  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (ASPLOS 2024) [arXiv] [Code]

    • CMU

  • Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]

    • UC Berkeley & ICSI & LBNL

  • Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]

    • Google Research

Offloading

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]

    • Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU

    • High-throughput serving; only use a single GPU.

Heterogeneous Environment

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (ICML 2024) [Personal Notes] [arXiv] [Code]

    • HKUST & ETH & CMU

    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)

    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

Fairness

LLM Alignment

  • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch (ATC 2024) [Paper]

    • THU

Acronyms

  • LLM: Large Language Model

  • LoRA: Low-Rank Adaptation

Last updated