Large Language Model (LLM)

I am actively maintaining this list.

LLM Training

  • Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 2023) [Paper] [arXiv] [Code]

    • UMich SymbioticLab & AWS & PKU

  • Gemini: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 2023) [Paper]

    • Rice & AWS

  • Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs (NSDI 2023) [Paper] [Code]

    • UCLA & CMU & MSR & Princeton

    • Resilient distributed training

  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 2022) [Paper] [Code] [Docs]

    • UC Berkeley & AWS & Google & SJTU & CMU & Duke

    • Generalize the search through parallelism strategies.

LLM Inference

  • Efficiently Programming Large Language Models using SGLang (arXiv 2312.07104) [Personal Notes] [arXiv] [Code]

    • UC Berkeley & Stanford

    • Co-design the front-end programming interface and back-end serving runtime

    • SGLang; SGVM w/ RadixAttention

    • Reuse KV cache across multiple calls and programs

  • PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (arXiv 2312.12456) [arXiv]

    • SJTU

    • A GPU-CPU hybrid inference engine

    • Hot-activated neurons are preloaded onto the GPU for fast access; cold-activated neurons are computed on the CPU

  • LLM in a flash: Efficient Large Language Model Inference with Limited Memory (arXiv 2312.11514) [arXiv]

    • Apple

  • Splitwise: Efficient Generative LLM Inference Using Phase Splitting (arXiv 2311.18677) [arXiv] [Blog]

    • UW & Microsoft

    • Split the two phases (i.e., prefill and decode) of a LLM inference request to separate machines

  • SpotServe: Serving Generative Large Language Models on Preemptible Instances (ASPLOS 2024) [Personal Notes] [arXiv] [Code]

    • CMU & PKU & CUHK

    • Distributed LLM serving system on preemptible/spot instances

    • Techniques

      • Dynamically adapt the LLM parallelization configuration

      • Minimize the cost of migrating instances for dynamic reparallelization

      • Stateful inference recovery

  • HexGen: Generative Inference of Foundation Model over Heterogeneous Decentralized Environment (arXiv 2311.11514) [Personal Notes] [arXiv] [Code]

    • HKUST & ETH & CMU

    • Support asymmetric tensor model parallelism and pipeline parallelism under the heterogeneous setting (i.e., each pipeline parallel stage can be assigned with a different number of layers and tensor model parallel degree)

    • Propose a heuristic-based evolutionary algorithm to search for the optimal layout

  • S-LoRA: Serving Thousands of Concurrent LoRA Adapters (arXiv 2311.03285) [arXiv] [Code]

    • UC Berkeley

    • A system to serve many LoRA adapters

    • Store all adapters in the main memory and fetch the adapters used by the currently running queries to the GPU memory

    • Unified Paging — a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths

    • Employ a tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation

    • Built on top of LightLLM

  • Punica: Multi-Tenant LoRA Serving (arXiv 2310.18547) [arXiv] [Code]

    • UW & Duke

    • A system to serve multiple LoRA models in a shared GPU cluster

    • A CUDA kernel — Segmented Gather Matrix-Vector Multiplication (SGMV)

      • Batch GPU operations for concurrent execution of different LoRA models

      • A GPU only needs to store a single copy of the pre-trained model

    • A request scheduling mechanism to consolidate multi-tenant LoRA serving workloads

      • Route the new request to a small set of active GPUs

      • Allocate additional GPU resources when the existing GPUs are fully utilized

      • Periodically migrate existing requests for consolidation

  • Efficient Memory Management for Large Language Model Serving with PagedAttention (SOSP 2023) [Paper] [arXiv] [Code] [Homepage]

    • UC Berkeley & Stanford & UCSD

    • vLLM, PagedAttention

    • Partition the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens

  • Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time (ICML 2023) [Paper] [Code]

    • Rice & ZJU & Stanford & UCSD & ETH & Adobe & Meta AI & CMU

    • A system to predict contextual sparsity (small, input-dependent sets that yield approximately the same output).

  • FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU (ICML 2023) [Personal Notes] [Paper] [Code]

    • Stanford & UC Berkeley & ETH & Yandex & HSE & Meta & CMU

    • High-throughput serving; only use a single GPU.

  • Fast Distributed Inference Serving for Large Language Models (arXiv 2305.05920) [Paper]

    • PKU

    • Skip-join multi-level feedback queue scheduling instead of first-come-frist-serve.

    • Proactive KV cache swapping.

    • Compared to Orca

  • AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 2023) [Paper] [Code]

    • UC Berkeley & PKU & UPenn & Stanford & Google

    • Trade-off between the overhead of model parallelism and reduced serving latency by statistical multiplexing.

  • Efficiently Scaling Transformer Inference (MLSys 2023) [Paper]

    • Google

    • Outstanding Paper Award

    • Model partitioning; PaLM; TPUv4

  • DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale (SC 2022) [Paper] [Code] [Homepage]

    • Microsoft DeepSpeed

    • Leverage CPU/NVMe/GPU memory.

  • Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI 2022) [Personal Notes] [Paper]

    • Seoul National University & FriendliAI

    • Iteration-level scheduling; selective batching.

Speculative Decoding

  • Online Speculative Decoding (arXiv: 2310.07177) [Paper]

    • UC Berkeley & UCSD & Sisu Data & SJTU

  • SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification (arXiv: 2305.09781) [Paper] [Code]

    • CMU

  • Speculative Decoding with Big Little Decoder (NeurIPS 2023) [Paper]

    • UC Berkeley & ICSI & LBNL

  • Fast Inference from Transformers via Speculative Decoding (ICML 2023) [Paper]

    • Google Research

LLMs

  • Llama 2: Open Foundation and Fine-Tuned Chat Models (arXiv 2307.09288) [Paper] [Homepage]

    • Released with a permissive community license and is available for commercial use.

  • LLaMA: Open and Efficient Foundation Language Models (arXiv 2302.13971) [Paper] [Code]

    • Meta AI

    • 6.7B, 13B, 32.5B, 65.2B

    • Open-access

  • PaLM: Scaling Language Modeling with Pathways (JMLR 2023) [Paper] [PaLM API]

    • 540B; open access to PaLM APIs in March 2023.

  • BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv 2211.05100) [Paper] [Model] [Blog]

    • 176B

    • open-access

  • OPT: Open Pre-trained Transformer Language Models (arXiv: 2205.01068) [Paper] [Code]

    • Meta AI

    • Range from 125M to 175B parameters.

    • Open-access

Acronyms

  • LLM: Large Language Model

Last updated