# OSDI 2025

## Meta Info

Homepage: <https://www.usenix.org/conference/osdi25>

Paper list: <https://www.usenix.org/conference/osdi25/technical-sessions>

### Acceptance Rate

14.6% (= 48 / 327)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wang-zheng)] \[[Video](https://www.youtube.com/watch?v=FJqRBCrY8Mg)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_wang_zheng.pdf)] \[[Code](https://github.com/Ash-Zheng/WLB-LLM-CP)]
    * UCSD & Meta
    * Imbalance across DP/PP workers → Input packing
      * Variable-length packing → Balance computation and communication latency.
      * Reorder all documents → Selectively delay long documents.
    * Imbalance across CP workers → Input sharding
      * Adaptively choose the CP sharding strategy with lower latency.
  * ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wang-zhuang)] \[[Video](https://www.youtube.com/watch?v=G1spkA40_CM)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_wang_zhuang.pdf)]
    * Rice
    * Three techniques for balanced parallelism on GPUs.
      * Communication-oriented hash memory management.
      * Multiple hash functions in each GPU thread.
      * Hierarchical consistent hashing across GPUs.
  * Understanding Stragglers in Large Model Training Using What-if Analysis \[[Paper](https://www.usenix.org/conference/osdi25/presentation/lin-jinkun)] \[[Video](https://www.youtube.com/watch?v=bLr_5OuiUVc)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_lin_jinkun.pdf)] \[[Artifact](https://github.com/ByteDance-Seed/StragglerAnalysis)]
    * NYU & ByteDance Seed
    * Trace
      * 3079 LLM pretraining jobs, collected from *homogeneous* clusters dedicated for LLM training.
    * Common causes of stragglers include: PP stage partitioning imbalance, sequence length imbalance, Python’s garbage collection.
  * Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks \[[Paper](https://www.usenix.org/conference/osdi25/presentation/jiang)] \[[Video](https://www.youtube.com/watch?v=Kf6P6RdGi5k)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_jiang_yuxuan.pdf)] \[[Code](https://github.com/OrderLab/TrainCheck)]
    * UMich
    * **TrainCheck** targets objective correctness violations (e.g., Incorrect API usage, buggy library implementation, faulty hardware).
    * Infer and check *training invariants* to prevent silent training errors.
      * Rule-level training invariants.
        * Example: The weights of certain layers should stay consistent across TP ranks.
      * Instrument a given DL training program to collect traces.
      * Define a set of generic relation templates & generate hypotheses based on a relation template and validate the hypotheses in the traces to generate invariants.
    * Results: Caught 18/20 real-world silent issues, identified 6 new bugs in DeepSpeed and Transformers.
* LLM Inference
  * NanoFlow: Towards Optimal Large Language Model Serving Throughput \[[Paper](https://www.usenix.org/conference/osdi25/presentation/zhu-kan)] \[[Video](https://www.youtube.com/watch?v=Ph7ho4ILQf0)] \[[Code](https://github.com/efeslab/Nanoflow)]
    * UW
    * Split inputs into smaller *nano-batches* and duplicate operations to operate on each portion independently → To overlap heterogeneous operations (e.g., compute, memory, network).
    * Propose *auto-search* to automatically construct an intra-device pipeline.
  * BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching \[[Paper](https://www.usenix.org/conference/osdi25/presentation/zhang-dingyan)] \[[Video](https://www.youtube.com/watch?v=pS5ofuHOVes)] \[[Slides](https://www.usenix.org/system/files/osdi25_slides_zhang_dingyan.pdf)] \[[arXiv](https://arxiv.org/abs/2412.17246)] \[[Code](https://github.com/blitz-serving/blitz-scale)]
    * SJTU IPADS & Huawei Cloud
    * Objective: Optimize model loading to improve instance startup / autoscaling.
    * Two designs
      * Load parameters from remote rather than local cache → Network-based multicast scaling.
        * Employ serial forwarding chain → Parameter multicast is bulk data sequential reading (i.e., *bandwidth bound*) & Limited gains for more complicated multicast algorithms.
        * Fast-link-first greedy forwarding order → Prefer scale-up network (e.g., NVLink) over scale-out network (e.g., RDMA).
        * All-gather model shards by scale-up network to aggregate scale-out network bandwidth.
      * Existing model instances cooperate with newly scaled ones.
        * New GPU instances borrow parameters from old ones (i.e., loading).
        * Old GPU instances borrow computing power from new ones (i.e., multiplexing).
  * WaferLLM: Large Language Model Inference at Wafer Scale \[[Paper](https://www.usenix.org/conference/osdi25/presentation/he)] \[[Slides](https://www.usenix.org/system/files/osdi25_slides_he.pdf)] \[[Video](https://www.usenix.org/system/files/osdi25_slides_he.pdf)] \[[Code](https://github.com/MeshInfra/WaferLLM)]
    * Edinburgh & MSRA
    * Wafer-scale LLM parallelism
    * MeshGEMM: a scalable GEMM algorithm for wafer-scale devices to accelerate the prefill phase.
    * MeshGEMV: a scalable GEMV algorithm for wafer-scale devices to accelerate the decode phase.
* LLM Quantization
  * DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/park-yeonhong)] \[[Video](https://www.youtube.com/watch?v=FEStC-7ZlJA)]
    * Seoul National University
    * Store the residual matrix—the difference between full-precision and quantized weights—in CPU, and dynamically fetch the residuals for only a small portion of the weights.

### Deep Learning Compilation

* Performance Profiling
  * KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads \[[Paper](https://www.usenix.org/conference/osdi25/presentation/guan)] \[[Video](https://www.youtube.com/watch?v=XfXWVgS7icE)] \[[Docs](https://triton-lang.org/main/dialects/ProtonOps.html)] \[[Artifact](https://github.com/ChandlerGuan/kperfir_artifact)]
    * UCSD & Meta & GMU & OpenAI
    * Integrate profiling capabilities directly into the compiler workflow.
    * Two takeaways
      * Performance profiling tools require the compiler’s IR to provide fine-grained performance metrics.
      * Compiler optimization passes need programmable performance profiling tools to effectively guide their optimization decisions.
    * Integrated into the Triton infrastructure.
* Code Generation
  * PipeThreader: Software-Defined Pipelining for Efficient DNN Execution \[[Paper](https://www.usenix.org/conference/osdi25/presentation/cheng)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_cheng_yu.pdf)] \[[Video](https://www.youtube.com/watch?v=ltHR_lS1QM8)] \[[Code](https://github.com/tile-ai/tilelang)]
    * PKU & MSRA
    * Three designs
      * sEU: expose heterogeneous specialized execution units of modern AI accelerators.
      * sTask and sTask-graph: expose fine-grained pipeline parallelism at tile level.
      * Scheduling primitives: build efficient pipeline schedules.
    * Integrated into TileLang.
  * Mirage: A Multi-Level Superoptimizer for Tensor Programs \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wu-mengdi)] \[[Video](https://www.youtube.com/watch?v=CKrQsMHUh8M)] \[[arXiv](https://arxiv.org/abs/2405.05751)] \[[Code](https://github.com/mirage-project/mirage)]
    * CMU
    * µGraphs: a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.
    * Tensor program → µGraph candidates (via µGraph generator) → verified µGraph (via equivalence verifier) → GPU kernel (via µGraph optimizer)
      * µGraph generator: generate all possible µGraphs up to a bounded size using exhaustive search.
      * Equivalence verifier: check whether generated µGraphs are correct by random testing with theoretical guarantee.
      * µGraph optimizer: apply optimizations that don't affect correctness of µGraphs (e.g., tensor layouts, memory planning, operator scheduling)
  * Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/jeong)] \[[Video](https://www.youtube.com/watch?v=5ZL9B8JfGhs)] \[[Code](https://github.com/eai-lab/BayesianCodeDiffusion)]
    * UNIST
    * Reformulate the concepts of prior and posterior distributions in the Bayesian framework to the context of deep learning program optimization.
    * Search for optimal program code in a reduced search space through an iterative diffusion of program code.
    * Implemented in [Ansor](https://www.usenix.org/conference/osdi20/presentation/zheng).
* Transcompiler
  * QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach \[[Paper](https://www.usenix.org/conference/osdi25/presentation/dong)] \[[Video](https://www.youtube.com/watch?v=0MDa0YlFIzM)]
    * USTC & Cambricon & ICT, CAS & ISCAS
    * Key insight: leverage the code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable.
    * Propose a transcompiler called **QiMeng-Xpiler**, to automatically translate tensor programs across DLS via both LLMs and symbolic program synthesis (i.e., neural-symbolic synthesis).

### GPU

* GPU Kernel Profiling
  * Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing \[[Paper](https://www.usenix.org/conference/osdi25/presentation/huang-songlin)] \[[Video](https://www.youtube.com/watch?v=Gh1joy8fXww)] \[[Code](https://github.com/open-neutrino/neutrino)]
    * HKU
    * eBPF-inspired probe interface.
      * probe = snippet (i.e., assembly) + tracepoint (at the finest instruction level) + map (*thread-level*: every thread saves, for value profiling & *warp-level*: only warp leader thread saves, for time profiling)
    * Virtualized probe execution model.
      * Directly place probes in the original assembly without protection.
      * Declare an independent register group logically at the assembly level.
    * Implementation
      * A hook driver (in C) to provide runtime support for assembly tracking, code caching, etc.
      * A probe engine (in Python) to instrument parallel assemblies.
      * A DSL compiler (in Python) to translate probes in platform-agnostic Python Tracing DSL into platform-specific assemblies (PTX for CUDA and GCNAsm for ROCm/HIP).
* GPU Preemption
  * Preemptive Scheduling for Diverse XPUs using Multi-level Hardware Model \[[Paper](https://www.usenix.org/conference/osdi25/presentation/shen-weihang)] \[[Code](https://github.com/XpuOS/xsched)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_shen_weihang.pdf)]
    * SJTU IPADS
    * XQueue: An XPU task is abstracted as a sequence of commands executed on a command queue.
    * Multi-level hardware model
      * Level-1: Preempt pending commands (block host CPU from launching new commands, no hardware requirements).
      * Level-2: Preempt in-flight commands (e.g., instruct the μ-controllers to stall command dispatching, leverage command programmability).
      * Level-3: Preempt running commands.
* GPU Communication
  * Enabling Efficient GPU Communication over Multiple NICs with FuseLink \[[Paper](https://www.usenix.org/conference/osdi25/presentation/ren)] \[[Video](https://www.youtube.com/watch?v=SRkM8zMDyf8)]
    * HKUST iSING Lab
    * Integrate high-speed intra-server links as critical extensions of the inter-server network.
    * Implemented as an independent networking module to replace the default Infiniband networking in NCCL.

### Resource Management

* Resource Allocation
  * Decouple and Decompose: Scaling Resource Allocation with DeDe \[[Paper](https://www.usenix.org/conference/osdi25/presentation/xu)] \[[Video](https://www.youtube.com/watch?v=qHEQvMfNrTU)] \[[Code](https://github.com/illinois-nsai/dede)]
    * Harvard & UIUC
    * Decouple entangled resource and demand constraints and decompose the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel.
    * Released as a Python package.
  * Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling \[[Paper](https://www.usenix.org/conference/osdi25/presentation/domingo)] \[[Video](https://www.youtube.com/watch?v=KoN8KZPIelA)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_david_domingo.pdf)]
    * Rutgers & MSR & Microsoft Azure
    * Objective: Manage VM request latencies for [Protean](https://www.usenix.org/conference/osdi20/presentation/hadary).
    * LatCache Scheduling
      * Key idea: Schedule requests where latency is minimized.
      * ExpectedTime = ProcessingTime + QueueingTime + RemainingTime
* Cold Start
  * Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems \[[Paper](https://www.usenix.org/conference/osdi25/presentation/chai-xiaohu)] \[[Video](https://www.youtube.com/watch?v=_Q6cXs34U8c)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_chai_xiaohu.pdf)] \[[Trace](https://github.com/antgroup/AFaaS)]
    * Ant Group & THU & SJTU
    * Gaps for fork-based cold start: control-path latency (18-20ms) + resource contention latency (unstable) + user code initialization latency (10ms-1s).
    * AFaaS: Ant FaaS
      * Propose FRI (Function Runtime Interface) to shorten the control path.
      * Resource pooling and sharing to alleviate resource contention.
      * Seeding user code to reduce user code load and initialization.

### Vector Search

* Quake: Adaptive Indexing for Vector Search \[[Paper](https://www.usenix.org/conference/osdi25/presentation/mohoney)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_mohoney_jason.pdf)] \[[Video](https://www.youtube.com/watch?v=8YI6DkhwAgo)] \[[Code](https://github.com/marius-team/quake)]
  * UW-Madison

### Databases

* Tigon: A Distributed Database for a CXL Pod \[[Paper](https://www.usenix.org/conference/osdi25/presentation/huang-yibo)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_huang_yibo.pdf)] \[[Video](https://www.youtube.com/watch?v=oJ2aS7l4Sto)] \[[Code](https://github.com/ut-datasys/tigon)]
  * UT-Austin
  * The first distributed transactional database for a CXL pod.

## Acronyms

* DL: Deep Learning
* DP: Data Parallelism
* CP: Context Parallelism
* PP: Pipeline Parallelism
* TP: Tensor Parallelism
* CXL: Compute Express Link
