OSDI 2025
Meta Info
Homepage: https://www.usenix.org/conference/osdi25
Paper list: https://www.usenix.org/conference/osdi25/technical-sessions
Acceptance Rate
14.6% (= 48 / 327)
Papers
Large Language Models (LLMs)
LLM Training
WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training [Paper] [Video] [Slides] [Code]
UCSD & Meta
Imbalance across DP/PP workers → Input packing
Variable-length packing → Balance computation and communication latency.
Reorder all documents → Selectively delay long documents.
Imbalance across CP workers → Input sharding
Adaptively choose the CP sharding strategy with lower latency.
Understanding Stragglers in Large Model Training Using What-if Analysis [Paper] [Video] [Slides] [Artifact]
NYU & ByteDance Seed
Trace
3079 LLM pretraining jobs, collected from homogeneous clusters dedicated for LLM training.
Common causes of stragglers include: PP stage partitioning imbalance, sequence length imbalance, Python’s garbage collection.
Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks [Paper] [Video] [Slides] [Code]
UMich
TrainCheck targets objective correctness violations (e.g., Incorrect API usage, buggy library implementation, faulty hardware).
Infer and check training invariants to prevent silent training errors.
Rule-level training invariants.
Example: The weights of certain layers should stay consistent across TP ranks.
Instrument a given DL training program to collect traces.
Define a set of generic relation templates & generate hypotheses based on a relation template and validate the hypotheses in the traces to generate invariants.
Results: Caught 18/20 real-world silent issues, identified 6 new bugs in DeepSpeed and Transformers.
LLM Inference
NanoFlow: Towards Optimal Large Language Model Serving Throughput [Paper] [Video] [Code]
UW
Split inputs into smaller nano-batches and duplicate operations to operate on each portion independently → To overlap heterogeneous operations (e.g., compute, memory, network).
Propose auto-search to automatically construct an intra-device pipeline.
BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching [Paper] [Video] [Slides] [arXiv] [Code]
SJTU IPADS & Huawei Cloud
Objective: Optimize model loading to improve instance startup / autoscaling.
Two designs
Load parameters from remote rather than local cache → Network-based multicast scaling.
Employ serial forwarding chain → Parameter multicast is bulk data sequential reading (i.e., bandwidth bound) & Limited gains for more complicated multicast algorithms.
Fast-link-first greedy forwarding order → Prefer scale-up network (e.g., NVLink) over scale-out network (e.g., RDMA).
All-gather model shards by scale-up network to aggregate scale-out network bandwidth.
Existing model instances cooperate with newly scaled ones.
New GPU instances borrow parameters from old ones (i.e., loading).
Old GPU instances borrow computing power from new ones (i.e., multiplexing).
WaferLLM: Large Language Model Inference at Wafer Scale [Paper] [Slides] [Video] [Code]
Edinburgh & MSRA
Wafer-scale LLM parallelism
MeshGEMM: a scalable GEMM algorithm for wafer-scale devices to accelerate the prefill phase.
MeshGEMV: a scalable GEMV algorithm for wafer-scale devices to accelerate the decode phase.
Deep Learning Compilation
Performance Profiling
KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads [Paper] [Video] [Docs] [Artifact]
UCSD & Meta & GMU & OpenAI
Integrate profiling capabilities directly into the compiler workflow.
Two takeaways
Performance profiling tools require the compiler’s IR to provide fine-grained performance metrics.
Compiler optimization passes need programmable performance profiling tools to effectively guide their optimization decisions.
Integrated into the Triton infrastructure.
Code Generation
PipeThreader: Software-Defined Pipelining for Efficient DNN Execution [Paper] [Slides] [Video] [Code]
PKU & MSRA
Three designs
sEU: expose heterogeneous specialized execution units of modern AI accelerators.
sTask and sTask-graph: expose fine-grained pipeline parallelism at tile level.
Scheduling primitives: build efficient pipeline schedules.
Integrated into TileLang.
Mirage: A Multi-Level Superoptimizer for Tensor Programs [Paper] [Video] [arXiv] [Code]
CMU
µGraphs: a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.
Tensor program → µGraph candidates (via µGraph generator) → verified µGraph (via equivalence verifier) → GPU kernel (via µGraph optimizer)
µGraph generator: generate all possible µGraphs up to a bounded size using exhaustive search.
Equivalence verifier: check whether generated µGraphs are correct by random testing with theoretical guarantee.
µGraph optimizer: apply optimizations that don't affect correctness of µGraphs (e.g., tensor layouts, memory planning, operator scheduling)
Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization [Paper] [Video] [Code]
UNIST
Reformulate the concepts of prior and posterior distributions in the Bayesian framework to the context of deep learning program optimization.
Search for optimal program code in a reduced search space through an iterative diffusion of program code.
Implemented in Ansor.
Transcompiler
QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [Paper] [Video]
USTC & Cambricon & ICT, CAS & ISCAS
Key insight: leverage the code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable.
Propose a transcompiler called QiMeng-Xpiler, to automatically translate tensor programs across DLS via both LLMs and symbolic program synthesis (i.e., neural-symbolic synthesis).
GPU
GPU Kernel Profiling
Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing [Paper] [Video] [Code]
HKU
eBPF-inspired probe interface.
probe = snippet (i.e., assembly) + tracepoint (at the finest instruction level) + map (thread-level: every thread saves, for value profiling & warp-level: only warp leader thread saves, for time profiling)
Virtualized probe execution model.
Directly place probes in the original assembly without protection.
Declare an independent register group logically at the assembly level.
Implementation
A hook driver (in C) to provide runtime support for assembly tracking, code caching, etc.
A probe engine (in Python) to instrument parallel assemblies.
A DSL compiler (in Python) to translate probes in platform-agnostic Python Tracing DSL into platform-specific assemblies (PTX for CUDA and GCNAsm for ROCm/HIP).
GPU Preemption
Preemptive Scheduling for Diverse XPUs using Multi-level Hardware Model [Paper] [Code] [Slides]
SJTU IPADS
XQueue: An XPU task is abstracted as a sequence of commands executed on a command queue.
Multi-level hardware model
Level-1: Preempt pending commands (block host CPU from launching new commands, no hardware requirements).
Level-2: Preempt in-flight commands (e.g., instruct the μ-controllers to stall command dispatching, leverage command programmability).
Level-3: Preempt running commands.
GPU Communication
Resource Management
Resource Allocation
Decouple and Decompose: Scaling Resource Allocation with DeDe [Paper] [Video] [Code]
Harvard & UIUC
Decouple entangled resource and demand constraints and decompose the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel.
Released as a Python package.
Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling [Paper] [Video] [Slides]
Rutgers & MSR & Microsoft Azure
Objective: Manage VM request latencies for Protean.
LatCache Scheduling
Key idea: Schedule requests where latency is minimized.
ExpectedTime = ProcessingTime + QueueingTime + RemainingTime
Cold Start
Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems [Paper] [Video] [Slides] [Trace]
Ant Group & THU & SJTU
Gaps for fork-based cold start: control-path latency (18-20ms) + resource contention latency (unstable) + user code initialization latency (10ms-1s).
AFaaS: Ant FaaS
Propose FRI (Function Runtime Interface) to shorten the control path.
Resource pooling and sharing to alleviate resource contention.
Seeding user code to reduce user code load and initialization.
Vector Search
Databases
Acronyms
DL: Deep Learning
DP: Data Parallelism
CP: Context Parallelism
PP: Pipeline Parallelism
TP: Tensor Parallelism
CXL: Compute Express Link
Last updated
Was this helpful?