OSDI 2025

Meta Info

Homepage: https://www.usenix.org/conference/osdi25

Paper list: https://www.usenix.org/conference/osdi25/technical-sessions

Acceptance Rate

14.6% (= 48 / 327)

Papers

Large Language Models (LLMs)

  • LLM Training

    • WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model Training [Paper] [Video] [Slides] [Code]

      • UCSD & Meta

      • Imbalance across DP/PP workers → Input packing

        • Variable-length packing → Balance computation and communication latency.

        • Reorder all documents → Selectively delay long documents.

      • Imbalance across CP workers → Input sharding

        • Adaptively choose the CP sharding strategy with lower latency.

    • ZEN: Empowering Distributed Training with Sparsity-driven Data Synchronization [Paper] [Video] [Slides]

      • Rice

      • Three techniques for balanced parallelism on GPUs.

        • Communication-oriented hash memory management.

        • Multiple hash functions in each GPU thread.

        • Hierarchical consistent hashing across GPUs.

    • Understanding Stragglers in Large Model Training Using What-if Analysis [Paper] [Video] [Slides] [Artifact]

      • NYU & ByteDance Seed

      • Trace

        • 3079 LLM pretraining jobs, collected from homogeneous clusters dedicated for LLM training.

      • Common causes of stragglers include: PP stage partitioning imbalance, sequence length imbalance, Python’s garbage collection.

    • Training with Confidence: Catching Silent Errors in Deep Learning Training with Automated Proactive Checks [Paper] [Video] [Slides] [Code]

      • UMich

      • TrainCheck targets objective correctness violations (e.g., Incorrect API usage, buggy library implementation, faulty hardware).

      • Infer and check training invariants to prevent silent training errors.

        • Rule-level training invariants.

          • Example: The weights of certain layers should stay consistent across TP ranks.

        • Instrument a given DL training program to collect traces.

        • Define a set of generic relation templates & generate hypotheses based on a relation template and validate the hypotheses in the traces to generate invariants.

      • Results: Caught 18/20 real-world silent issues, identified 6 new bugs in DeepSpeed and Transformers.

  • LLM Inference

    • NanoFlow: Towards Optimal Large Language Model Serving Throughput [Paper] [Video] [Code]

      • UW

      • Split inputs into smaller nano-batches and duplicate operations to operate on each portion independently → To overlap heterogeneous operations (e.g., compute, memory, network).

      • Propose auto-search to automatically construct an intra-device pipeline.

    • BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching [Paper] [Video] [Slides] [arXiv] [Code]

      • SJTU IPADS & Huawei Cloud

      • Objective: Optimize model loading to improve instance startup / autoscaling.

      • Two designs

        • Load parameters from remote rather than local cache → Network-based multicast scaling.

          • Employ serial forwarding chain → Parameter multicast is bulk data sequential reading (i.e., bandwidth bound) & Limited gains for more complicated multicast algorithms.

          • Fast-link-first greedy forwarding order → Prefer scale-up network (e.g., NVLink) over scale-out network (e.g., RDMA).

          • All-gather model shards by scale-up network to aggregate scale-out network bandwidth.

        • Existing model instances cooperate with newly scaled ones.

          • New GPU instances borrow parameters from old ones (i.e., loading).

          • Old GPU instances borrow computing power from new ones (i.e., multiplexing).

    • WaferLLM: Large Language Model Inference at Wafer Scale [Paper] [Slides] [Video] [Code]

      • Edinburgh & MSRA

      • Wafer-scale LLM parallelism

      • MeshGEMM: a scalable GEMM algorithm for wafer-scale devices to accelerate the prefill phase.

      • MeshGEMV: a scalable GEMV algorithm for wafer-scale devices to accelerate the decode phase.

  • LLM Quantization

    • DecDEC: A Systems Approach to Advancing Low-Bit LLM Quantization [Paper] [Video]

      • Seoul National University

      • Store the residual matrix—the difference between full-precision and quantized weights—in CPU, and dynamically fetch the residuals for only a small portion of the weights.

Deep Learning Compilation

  • Performance Profiling

    • KPerfIR: Towards a Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads [Paper] [Video] [Docs] [Artifact]

      • UCSD & Meta & GMU & OpenAI

      • Integrate profiling capabilities directly into the compiler workflow.

      • Two takeaways

        • Performance profiling tools require the compiler’s IR to provide fine-grained performance metrics.

        • Compiler optimization passes need programmable performance profiling tools to effectively guide their optimization decisions.

      • Integrated into the Triton infrastructure.

  • Code Generation

    • PipeThreader: Software-Defined Pipelining for Efficient DNN Execution [Paper] [Slides] [Video] [Code]

      • PKU & MSRA

      • Three designs

        • sEU: expose heterogeneous specialized execution units of modern AI accelerators.

        • sTask and sTask-graph: expose fine-grained pipeline parallelism at tile level.

        • Scheduling primitives: build efficient pipeline schedules.

      • Integrated into TileLang.

    • Mirage: A Multi-Level Superoptimizer for Tensor Programs [Paper] [Video] [arXiv] [Code]

      • CMU

      • µGraphs: a uniform representation of tensor programs at the kernel, thread block, and thread levels of the GPU compute hierarchy.

      • Tensor program → µGraph candidates (via µGraph generator) → verified µGraph (via equivalence verifier) → GPU kernel (via µGraph optimizer)

        • µGraph generator: generate all possible µGraphs up to a bounded size using exhaustive search.

        • Equivalence verifier: check whether generated µGraphs are correct by random testing with theoretical guarantee.

        • µGraph optimizer: apply optimizations that don't affect correctness of µGraphs (e.g., tensor layouts, memory planning, operator scheduling)

    • Bayesian Code Diffusion for Efficient Automatic Deep Learning Program Optimization [Paper] [Video] [Code]

      • UNIST

      • Reformulate the concepts of prior and posterior distributions in the Bayesian framework to the context of deep learning program optimization.

      • Search for optimal program code in a reduced search space through an iterative diffusion of program code.

      • Implemented in Ansor.

  • Transcompiler

    • QiMeng-Xpiler: Transcompiling Tensor Programs for Deep Learning Systems with a Neural-Symbolic Approach [Paper] [Video]

      • USTC & Cambricon & ICT, CAS & ISCAS

      • Key insight: leverage the code generation ability of LLM to make costly search-based symbolic synthesis computationally tractable.

      • Propose a transcompiler called QiMeng-Xpiler, to automatically translate tensor programs across DLS via both LLMs and symbolic program synthesis (i.e., neural-symbolic synthesis).

GPU

  • GPU Kernel Profiling

    • Neutrino: Fine-grained GPU Kernel Profiling via Programmable Probing [Paper] [Video] [Code]

      • HKU

      • eBPF-inspired probe interface.

        • probe = snippet (i.e., assembly) + tracepoint (at the finest instruction level) + map (thread-level: every thread saves, for value profiling & warp-level: only warp leader thread saves, for time profiling)

      • Virtualized probe execution model.

        • Directly place probes in the original assembly without protection.

        • Declare an independent register group logically at the assembly level.

      • Implementation

        • A hook driver (in C) to provide runtime support for assembly tracking, code caching, etc.

        • A probe engine (in Python) to instrument parallel assemblies.

        • A DSL compiler (in Python) to translate probes in platform-agnostic Python Tracing DSL into platform-specific assemblies (PTX for CUDA and GCNAsm for ROCm/HIP).

  • GPU Preemption

    • Preemptive Scheduling for Diverse XPUs using Multi-level Hardware Model [Paper] [Code] [Slides]

      • SJTU IPADS

      • XQueue: An XPU task is abstracted as a sequence of commands executed on a command queue.

      • Multi-level hardware model

        • Level-1: Preempt pending commands (block host CPU from launching new commands, no hardware requirements).

        • Level-2: Preempt in-flight commands (e.g., instruct the μ-controllers to stall command dispatching, leverage command programmability).

        • Level-3: Preempt running commands.

  • GPU Communication

    • Enabling Efficient GPU Communication over Multiple NICs with FuseLink [Paper] [Video]

      • HKUST iSING Lab

      • Integrate high-speed intra-server links as critical extensions of the inter-server network.

      • Implemented as an independent networking module to replace the default Infiniband networking in NCCL.

Resource Management

  • Resource Allocation

    • Decouple and Decompose: Scaling Resource Allocation with DeDe [Paper] [Video] [Code]

      • Harvard & UIUC

      • Decouple entangled resource and demand constraints and decompose the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel.

      • Released as a Python package.

    • Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling [Paper] [Video] [Slides]

      • Rutgers & MSR & Microsoft Azure

      • Objective: Manage VM request latencies for Protean.

      • LatCache Scheduling

        • Key idea: Schedule requests where latency is minimized.

        • ExpectedTime = ProcessingTime + QueueingTime + RemainingTime

  • Cold Start

    • Fork in the Road: Reflections and Optimizations for Cold Start Latency in Production Serverless Systems [Paper] [Video] [Slides] [Trace]

      • Ant Group & THU & SJTU

      • Gaps for fork-based cold start: control-path latency (18-20ms) + resource contention latency (unstable) + user code initialization latency (10ms-1s).

      • AFaaS: Ant FaaS

        • Propose FRI (Function Runtime Interface) to shorten the control path.

        • Resource pooling and sharing to alleviate resource contention.

        • Seeding user code to reduce user code load and initialization.

Databases

  • Tigon: A Distributed Database for a CXL Pod [Paper] [Slides] [Video] [Code]

    • UT-Austin

    • The first distributed transactional database for a CXL pod.

Acronyms

  • DL: Deep Learning

  • DP: Data Parallelism

  • CP: Context Parallelism

  • PP: Pipeline Parallelism

  • TP: Tensor Parallelism

  • CXL: Compute Express Link

Last updated

Was this helpful?