# OSDI 2025

## Meta Info

Homepage: <https://www.usenix.org/conference/osdi25>

Paper list: <https://www.usenix.org/conference/osdi25/technical-sessions>

### Acceptance Rate

14.6% (= 48 / 327)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wang-zheng)] \[[Video](https://www.youtube.com/watch?v=FJqRBCrY8Mg)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_wang_zheng.pdf)] \[[Code](https://github.com/Ash-Zheng/WLB-LLM-CP)]
    * UCSD & Meta
    * Imbalance across DP/PP workers → Input packing
      * Variable-length packing → Balance computation and communication latency.
      * Reorder all documents → Selectively delay long documents.
    * Imbalance across CP workers → Input sharding
      * Adaptively choose the CP sharding strategy with lower latency.
  * ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wang-zhuang)] \[[Video](https://www.youtube.com/watch?v=G1spkA40_CM)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_wang_zhuang.pdf)]
    * Rice
    * Three techniques for balanced parallelism on GPUs.
      * Communication-oriented hash memory management.
      * Multiple hash functions in each GPU thread.
      * Hierarchical consistent hashing across GPUs.
  * Understanding Stragglers in Large Model Training Using What-if Analysis \[[Paper](https://www.usenix.org/conference/osdi25/presentation/lin-jinkun)] \[[Video](https://www.youtube.com/watch?v=bLr_5OuiUVc)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_lin_jinkun.pdf)] \[[Artifact](https://github.com/ByteDance-Seed/StragglerAnalysis)]
    * NYU & ByteDance Seed
    * Trace
      * 3079 LLM pretraining jobs, collected from *homogeneous* clusters dedicated for LLM training.
    * Common causes of stragglers include: PP stage partitioning imbalance, sequence length imbalance, Python’s garbage collection.
  * Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks \[[Paper](https://www.usenix.org/conference/osdi25/presentation/jiang)] \[[Video](https://www.youtube.com/watch?v=Kf6P6RdGi5k)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_jiang_yuxuan.pdf)] \[[Code](https://github.com/OrderLab/TrainCheck)]
    * UMich
    * **TrainCheck** targets objective correctness violations (e.g., Incorrect API usage, buggy library implementation, faulty hardware).
    * Infer and check *training invariants* to prevent silent training errors.
      * Rule-level training invariants.
        * Example: The weights of certain layers should stay consistent across TP ranks.
      * Instrument a given DL training program to collect traces.
      * Define a set of generic relation templates & generate hypotheses based on a relation template and validate the hypotheses in the traces to generate invariants.
    * Results: Caught 18/20 real-world silent issues, identified 6 new bugs in DeepSpeed and Transformers.
* LLM Inference
  * NanoFlow: Towards Optimal Large Language Model Serving Throughput \[[Paper](https://www.usenix.org/conference/osdi25/presentation/zhu-kan)] \[[Video](https://www.youtube.com/watch?v=Ph7ho4ILQf0)] \[[Code](https://github.com/efeslab/Nanoflow)]
    * UW
    * Split inputs into smaller *nano-batches* and duplicate operations to operate on each portion independently → To overlap heterogeneous operations (e.g., compute, memory, network).
    * Propose *auto-search* to automatically construct an intra-device pipeline.
  * BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching \[[Paper](https://www.usenix.org/conference/osdi25/presentation/zhang-dingyan)] \[[Video](https://www.youtube.com/watch?v=pS5ofuHOVes)] \[[Slides](https://www.usenix.org/system/files/osdi25_slides_zhang_dingyan.pdf)] \[[arXiv](https://arxiv.org/abs/2412.17246)] \[[Code](https://github.com/blitz-serving/blitz-scale)]
    * SJTU IPADS & Huawei Cloud
    * Objective: Optimize model loading to improve instance startup / autoscaling.
    * Two designs
      * Load parameters from remote rather than local cache → Network-based multicast scaling.
        * Employ serial forwarding chain → Parameter multicast is bulk data sequential reading (i.e., *bandwidth bound*) & Limited gains for more complicated multicast algorithms.
        * Fast-link-first greedy forwarding order → Prefer scale-up network (e.g., NVLink) over scale-out network (e.g., RDMA).
        * All-gather model shards by scale-up network to aggregate scale-out network bandwidth.
      * Existing model instances cooperate with newly scaled ones.
        * New GPU instances borrow parameters from old ones (i.e., loading).
        * Old GPU instances borrow computing power from new ones (i.e., multiplexing).
  * WaferLLM: Large Language Model Inference at Wafer Scale \[[Paper](https://www.usenix.org/conference/osdi25/presentation/he)] \[[Slides](https://www.usenix.org/system/files/osdi25_slides_he.pdf)] \[[Video](https://www.usenix.org/system/files/osdi25_slides_he.pdf)] \[[Code](https://github.com/MeshInfra/WaferLLM)]
    * Edinburgh & MSRA
    * Wafer-scale LLM parallelism
    * MeshGEMM: a scalable GEMM algorithm for wafer-scale devices to accelerate the prefill phase.
    * MeshGEMV: a scalable GEMV algorithm for wafer-scale devices to accelerate the decode phase.
* LLM Quantization
  * DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/park-yeonhong)] \[[Video](https://www.youtube.com/watch?v=FEStC-7ZlJA)]
    * Seoul National University
    * Store the residual matrix—the difference between full-precision and quantized weights—in CPU, and dynamically fetch the residuals for only a small portion of the weights.

### Deep Learning Compilation

* Performance Profiling
  * KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads \[[Paper](https://www.usenix.org/conference/osdi25/presentation/guan)] \[[Video](https://www.youtube.com/watch?v=XfXWVgS7icE)] \[[Docs](https://triton-lang.org/main/dialects/ProtonOps.html)] \[[Artifact](https://github.com/ChandlerGuan/kperfir_artifact)]
    * UCSD & Meta & GMU & OpenAI
    * Integrate profiling capabilities directly into the compiler workflow.
    * Two takeaways
      * Performance profiling tools require the compiler’s IR to provide fine-grained performance metrics.
      * Compiler optimization passes need programmable performance profiling tools to effectively guide their optimization decisions.
    * Integrated into the Triton infrastructure.
* Code Generation
  * PipeThreader: Software-Defined Pipelining for Efficient DNN Execution \[[Paper](https://www.usenix.org/conference/osdi25/presentation/cheng)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_cheng_yu.pdf)] \[[Video](https://www.youtube.com/watch?v=ltHR_lS1QM8)] \[[Code](https://github.com/tile-ai/tilelang)]
    * PKU & MSRA
    * Three designs
      * sEU: expose heterogeneous specialized execution units of modern AI accelerators.
      * sTask and sTask-graph: expose fine-grained pipeline parallelism at tile level.
      * Scheduling primitives: build efficient pipeline schedules.
    * Integrated into TileLang.
  * Mirage: A Multi-Level Superoptimizer for Tensor Programs \[[Paper](https://www.usenix.org/conference/osdi25/presentation/wu-mengdi)] \[[Video](https://www.youtube.com/watch?v=CKrQsMHUh8M)] \[[arXiv](https://arxiv.org/abs/2405.05751)] \[[Code](https://github.com/mirage-project/mirage)]
    * CMU
    * µGraphs: a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.
    * Tensor program → µGraph candidates (via µGraph generator) → verified µGraph (via equivalence verifier) → GPU kernel (via µGraph optimizer)
      * µGraph generator: generate all possible µGraphs up to a bounded size using exhaustive search.
      * Equivalence verifier: check whether generated µGraphs are correct by random testing with theoretical guarantee.
      * µGraph optimizer: apply optimizations that don't affect correctness of µGraphs (e.g., tensor layouts, memory planning, operator scheduling)
  * Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization \[[Paper](https://www.usenix.org/conference/osdi25/presentation/jeong)] \[[Video](https://www.youtube.com/watch?v=5ZL9B8JfGhs)] \[[Code](https://github.com/eai-lab/BayesianCodeDiffusion)]
    * UNIST
    * Reformulate the concepts of prior and posterior distributions in the Bayesian framework to the context of deep learning program optimization.
    * Search for optimal program code in a reduced search space through an iterative diffusion of program code.
    * Implemented in [Ansor](https://www.usenix.org/conference/osdi20/presentation/zheng).
* Transcompiler
  * QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach \[[Paper](https://www.usenix.org/conference/osdi25/presentation/dong)] \[[Video](https://www.youtube.com/watch?v=0MDa0YlFIzM)]
    * USTC & Cambricon & ICT, CAS & ISCAS
    * Key insight: leverage the code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable.
    * Propose a transcompiler called **QiMeng-Xpiler**, to automatically translate tensor programs across DLS via both LLMs and symbolic program synthesis (i.e., neural-symbolic synthesis).

### GPU

* GPU Kernel Profiling
  * Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing \[[Paper](https://www.usenix.org/conference/osdi25/presentation/huang-songlin)] \[[Video](https://www.youtube.com/watch?v=Gh1joy8fXww)] \[[Code](https://github.com/open-neutrino/neutrino)]
    * HKU
    * eBPF-inspired probe interface.
      * probe = snippet (i.e., assembly) + tracepoint (at the finest instruction level) + map (*thread-level*: every thread saves, for value profiling & *warp-level*: only warp leader thread saves, for time profiling)
    * Virtualized probe execution model.
      * Directly place probes in the original assembly without protection.
      * Declare an independent register group logically at the assembly level.
    * Implementation
      * A hook driver (in C) to provide runtime support for assembly tracking, code caching, etc.
      * A probe engine (in Python) to instrument parallel assemblies.
      * A DSL compiler (in Python) to translate probes in platform-agnostic Python Tracing DSL into platform-specific assemblies (PTX for CUDA and GCNAsm for ROCm/HIP).
* GPU Preemption
  * Preemptive Scheduling for Diverse XPUs using Multi-level Hardware Model \[[Paper](https://www.usenix.org/conference/osdi25/presentation/shen-weihang)] \[[Code](https://github.com/XpuOS/xsched)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_shen_weihang.pdf)]
    * SJTU IPADS
    * XQueue: An XPU task is abstracted as a sequence of commands executed on a command queue.
    * Multi-level hardware model
      * Level-1: Preempt pending commands (block host CPU from launching new commands, no hardware requirements).
      * Level-2: Preempt in-flight commands (e.g., instruct the μ-controllers to stall command dispatching, leverage command programmability).
      * Level-3: Preempt running commands.
* GPU Communication
  * Enabling Efficient GPU Communication over Multiple NICs with FuseLink \[[Paper](https://www.usenix.org/conference/osdi25/presentation/ren)] \[[Video](https://www.youtube.com/watch?v=SRkM8zMDyf8)]
    * HKUST iSING Lab
    * Integrate high-speed intra-server links as critical extensions of the inter-server network.
    * Implemented as an independent networking module to replace the default Infiniband networking in NCCL.

### Resource Management

* Resource Allocation
  * Decouple and Decompose: Scaling Resource Allocation with DeDe \[[Paper](https://www.usenix.org/conference/osdi25/presentation/xu)] \[[Video](https://www.youtube.com/watch?v=qHEQvMfNrTU)] \[[Code](https://github.com/illinois-nsai/dede)]
    * Harvard & UIUC
    * Decouple entangled resource and demand constraints and decompose the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel.
    * Released as a Python package.
  * Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling \[[Paper](https://www.usenix.org/conference/osdi25/presentation/domingo)] \[[Video](https://www.youtube.com/watch?v=KoN8KZPIelA)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_david_domingo.pdf)]
    * Rutgers & MSR & Microsoft Azure
    * Objective: Manage VM request latencies for [Protean](https://www.usenix.org/conference/osdi20/presentation/hadary).
    * LatCache Scheduling
      * Key idea: Schedule requests where latency is minimized.
      * ExpectedTime = ProcessingTime + QueueingTime + RemainingTime
* Cold Start
  * Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems \[[Paper](https://www.usenix.org/conference/osdi25/presentation/chai-xiaohu)] \[[Video](https://www.youtube.com/watch?v=_Q6cXs34U8c)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_chai_xiaohu.pdf)] \[[Trace](https://github.com/antgroup/AFaaS)]
    * Ant Group & THU & SJTU
    * Gaps for fork-based cold start: control-path latency (18-20ms) + resource contention latency (unstable) + user code initialization latency (10ms-1s).
    * AFaaS: Ant FaaS
      * Propose FRI (Function Runtime Interface) to shorten the control path.
      * Resource pooling and sharing to alleviate resource contention.
      * Seeding user code to reduce user code load and initialization.

### Vector Search

* Quake: Adaptive Indexing for Vector Search \[[Paper](https://www.usenix.org/conference/osdi25/presentation/mohoney)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_mohoney_jason.pdf)] \[[Video](https://www.youtube.com/watch?v=8YI6DkhwAgo)] \[[Code](https://github.com/marius-team/quake)]
  * UW-Madison

### Databases

* Tigon: A Distributed Database for a CXL Pod \[[Paper](https://www.usenix.org/conference/osdi25/presentation/huang-yibo)] \[[Slides](https://www.usenix.org/sites/default/files/conference/protected-files/osdi25_slides_huang_yibo.pdf)] \[[Video](https://www.youtube.com/watch?v=oJ2aS7l4Sto)] \[[Code](https://github.com/ut-datasys/tigon)]
  * UT-Austin
  * The first distributed transactional database for a CXL pod.

## Acronyms

* DL: Deep Learning
* DP: Data Parallelism
* CP: Context Parallelism
* PP: Pipeline Parallelism
* TP: Tensor Parallelism
* CXL: Compute Express Link


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
