OSDI 2025

Meta Info

Homepage: https://www.usenix.org/conference/osdi25

Paper list: https://www.usenix.org/conference/osdi25/technical-sessions

Acceptance Rate

14.6% (= 48 / 327)

Papers

Large Language Models (LLMs)

LLM Training
- WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training [Paper] [Video] [Slides] [Code]
  - UCSD & Meta
  - Imbalance across DP/PP workers → Input packing
    Variable-length packing → Balance computation and communication latency.
    Reorder all documents → Selectively delay long documents.
  - Imbalance across CP workers → Input sharding
    Adaptively choose the CP sharding strategy with lower latency.
- ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization [Paper] [Video] [Slides]
  - Rice
  - Three techniques for balanced parallelism on GPUs.
    Communication-oriented hash memory management.
    Multiple hash functions in each GPU thread.
    Hierarchical consistent hashing across GPUs.
- Understanding Stragglers in Large Model Training Using What-if Analysis [Paper] [Video] [Slides] [Artifact]
  - NYU & ByteDance Seed
  - Trace
    3079 LLM pretraining jobs, collected from homogeneous clusters dedicated for LLM training.
  - Common causes of stragglers include: PP stage partitioning imbalance, sequence length imbalance, Python’s garbage collection.
- Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks [Paper] [Video] [Slides] [Code]
  - UMich
  - TrainCheck targets objective correctness violations (e.g., Incorrect API usage, buggy library implementation, faulty hardware).
  - Infer and check training invariants to prevent silent training errors.
    Rule-level training invariants.
    Example: The weights of certain layers should stay consistent across TP ranks.
    Instrument a given DL training program to collect traces.
    Define a set of generic relation templates & generate hypotheses based on a relation template and validate the hypotheses in the traces to generate invariants.
  - Results: Caught 18/20 real-world silent issues, identified 6 new bugs in DeepSpeed and Transformers.
LLM Inference
- NanoFlow: Towards Optimal Large Language Model Serving Throughput [Paper] [Video] [Code]
  - UW
  - Split inputs into smaller nano-batches and duplicate operations to operate on each portion independently → To overlap heterogeneous operations (e.g., compute, memory, network).
  - Propose auto-search to automatically construct an intra-device pipeline.
- BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching [Paper] [Video] [Slides] [arXiv] [Code]
  - SJTU IPADS & Huawei Cloud
  - Objective: Optimize model loading to improve instance startup / autoscaling.
  - Two designs
    Load parameters from remote rather than local cache → Network-based multicast scaling.
    Employ serial forwarding chain → Parameter multicast is bulk data sequential reading (i.e., bandwidth bound) & Limited gains for more complicated multicast algorithms.
    Fast-link-first greedy forwarding order → Prefer scale-up network (e.g., NVLink) over scale-out network (e.g., RDMA).
    All-gather model shards by scale-up network to aggregate scale-out network bandwidth.
    Existing model instances cooperate with newly scaled ones.
    New GPU instances borrow parameters from old ones (i.e., loading).
    Old GPU instances borrow computing power from new ones (i.e., multiplexing).
- WaferLLM: Large Language Model Inference at Wafer Scale [Paper] [Slides] [Video] [Code]
  - Edinburgh & MSRA
  - Wafer-scale LLM parallelism
  - MeshGEMM: a scalable GEMM algorithm for wafer-scale devices to accelerate the prefill phase.
  - MeshGEMV: a scalable GEMV algorithm for wafer-scale devices to accelerate the decode phase.
LLM Quantization
- DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization [Paper] [Video]
  - Seoul National University
  - Store the residual matrix—the difference between full-precision and quantized weights—in CPU, and dynamically fetch the residuals for only a small portion of the weights.

Deep Learning Compilation

Performance Profiling
- KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads [Paper] [Video] [Docs] [Artifact]
  - UCSD & Meta & GMU & OpenAI
  - Integrate profiling capabilities directly into the compiler workflow.
  - Two takeaways
    Performance profiling tools require the compiler’s IR to provide fine-grained performance metrics.
    Compiler optimization passes need programmable performance profiling tools to effectively guide their optimization decisions.
  - Integrated into the Triton infrastructure.
Code Generation
- PipeThreader: Software-Defined Pipelining for Efficient DNN Execution [Paper] [Slides] [Video] [Code]
  - PKU & MSRA
  - Three designs
    sEU: expose heterogeneous specialized execution units of modern AI accelerators.
    sTask and sTask-graph: expose fine-grained pipeline parallelism at tile level.
    Scheduling primitives: build efficient pipeline schedules.
  - Integrated into TileLang.
- Mirage: A Multi-Level Superoptimizer for Tensor Programs [Paper] [Video] [arXiv] [Code]
  - CMU
  - µGraphs: a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.
  - Tensor program → µGraph candidates (via µGraph generator) → verified µGraph (via equivalence verifier) → GPU kernel (via µGraph optimizer)
    µGraph generator: generate all possible µGraphs up to a bounded size using exhaustive search.
    Equivalence verifier: check whether generated µGraphs are correct by random testing with theoretical guarantee.
    µGraph optimizer: apply optimizations that don't affect correctness of µGraphs (e.g., tensor layouts, memory planning, operator scheduling)
- Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization [Paper] [Video] [Code]
  - UNIST
  - Reformulate the concepts of prior and posterior distributions in the Bayesian framework to the context of deep learning program optimization.
  - Search for optimal program code in a reduced search space through an iterative diffusion of program code.
  - Implemented in Ansor.
Transcompiler
- QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [Paper] [Video]
  - USTC & Cambricon & ICT, CAS & ISCAS
  - Key insight: leverage the code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable.
  - Propose a transcompiler called QiMeng-Xpiler, to automatically translate tensor programs across DLS via both LLMs and symbolic program synthesis (i.e., neural-symbolic synthesis).

GPU

GPU Kernel Profiling
- Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing [Paper] [Video] [Code]
  - HKU
  - eBPF-inspired probe interface.
    probe = snippet (i.e., assembly) + tracepoint (at the finest instruction level) + map (thread-level: every thread saves, for value profiling & warp-level: only warp leader thread saves, for time profiling)
  - Virtualized probe execution model.
    Directly place probes in the original assembly without protection.
    Declare an independent register group logically at the assembly level.
  - Implementation
    A hook driver (in C) to provide runtime support for assembly tracking, code caching, etc.
    A probe engine (in Python) to instrument parallel assemblies.
    A DSL compiler (in Python) to translate probes in platform-agnostic Python Tracing DSL into platform-specific assemblies (PTX for CUDA and GCNAsm for ROCm/HIP).
GPU Preemption
- Preemptive Scheduling for Diverse XPUs using Multi-level Hardware Model [Paper] [Code] [Slides]
  - SJTU IPADS
  - XQueue: An XPU task is abstracted as a sequence of commands executed on a command queue.
  - Multi-level hardware model
    Level-1: Preempt pending commands (block host CPU from launching new commands, no hardware requirements).
    Level-2: Preempt in-flight commands (e.g., instruct the μ-controllers to stall command dispatching, leverage command programmability).
    Level-3: Preempt running commands.
GPU Communication
- Enabling Efficient GPU Communication over Multiple NICs with FuseLink [Paper] [Video]
  - HKUST iSING Lab
  - Integrate high-speed intra-server links as critical extensions of the inter-server network.
  - Implemented as an independent networking module to replace the default Infiniband networking in NCCL.

Resource Management

Resource Allocation
- Decouple and Decompose: Scaling Resource Allocation with DeDe [Paper] [Video] [Code]
  - Harvard & UIUC
  - Decouple entangled resource and demand constraints and decompose the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel.
  - Released as a Python package.
- Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling [Paper] [Video] [Slides]
  - Rutgers & MSR & Microsoft Azure
  - Objective: Manage VM request latencies for Protean.
  - LatCache Scheduling
    Key idea: Schedule requests where latency is minimized.
    ExpectedTime = ProcessingTime + QueueingTime + RemainingTime
Cold Start
- Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems [Paper] [Video] [Slides] [Trace]
  - Ant Group & THU & SJTU
  - Gaps for fork-based cold start: control-path latency (18-20ms) + resource contention latency (unstable) + user code initialization latency (10ms-1s).
  - AFaaS: Ant FaaS
    Propose FRI (Function Runtime Interface) to shorten the control path.
    Resource pooling and sharing to alleviate resource contention.
    Seeding user code to reduce user code load and initialization.

Vector Search

Quake: Adaptive Indexing for Vector Search [Paper] [Slides] [Video] [Code]
- UW-Madison

Databases

Tigon: A Distributed Database for a CXL Pod [Paper] [Slides] [Video] [Code]
- UT-Austin
- The first distributed transactional database for a CXL pod.

Acronyms

DL: Deep Learning
DP: Data Parallelism
CP: Context Parallelism
PP: Pipeline Parallelism
TP: Tensor Parallelism
CXL: Compute Express Link

Last updated 2 months ago

Was this helpful?