SOSP 2025

Meta Info

Homepage: https://sigops.org/s/conferences/sosp/2025/

Paper List

Acceptance Rate

17.7% (= 65 / 368)

Papers

LLM

  • LLM Training

    • Robust LLM Training Infrastructure at ByteDance [Paper]

      • HKU & ByteDance Seed

      • ByteRobust

    • Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training [Paper]

      • CUHK & ByteDance & ByteDance Seed

      • A lightweight distributed tracing and root cause analysis system.

      • Trace collective communication states and leverage internal control and data dependencies.

    • DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism [Paper]

      • HKU & AWS

      • Introduce fine-grained blockwise partitioning of both data and computation.

    • TrainVerify: Equivalence-Based Verification for Distributed LLM Training [Paper]

      • UMich & MSRA

      • Formally verify that a distributed parallel execution plan is mathematically equivalent to the logical specification.

      • Introduce a stage-wise parallel verification algorithm and shape-reduction techniques → Reduce complexity while preserving formal correctness.

  • LLM Inference

    • Jenga: Effective Memory Management for Serving LLM with Heterogeneity [Paper] [arXiv]

      • THU & Chicago & UC Berkeley

      • Two challenges

        • Recent models have heterogeneous embeddings with different sizes.

        • Some new architectures use only a subset of the prefix tokens to generate the next token.

      • Designs

        • Two-level memory allocator: choose the page size as least common multiple of token embedding sizes.

        • Enable attention variants to customize this mechanism by precisely specifying the exact prefix subset.

    • PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications [Paper] [arXiv]

      • Chicago & THU & LinkedIn & UC Berkeley

      • Hybrid prefilling: Prefill non-attention layers chunk-by-chunk, but prefill the attention layers normally.

      • Suffix KV cache discarding / offloading: Discard the useless KV cache.

      • Continuous JCT calibration: Continuously reestimate the JCT of each request based on what requests are previously scheduled, and then schedule just one request with the lowest JCT.

    • IC-Cache: Efficient Large Language Model Serving via In-context Caching [Paper]

      • UIUC & Google

      • Leverage historical request-response pairs from larger models as in-context examples.

    • Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market [Paper]

      • PKU & Alibaba Cloud

      • Schedule multimodel requests and make auto-scaling decisions on a per-token basis to maximize service quality.

      • Reduce auto-scaling overhead through component reuse, explicit memory management, and fine-grained KV cache synchronization.

    • Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference [Paper]

      • SJTU IPADS & THU & SenseTime

  • LLM Applications

    • Pie: A Programmable Serving System for Emerging LLM Applications [Paper]

      • Yale

      • Decompose the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets.

      • Enable applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O—entirely within the application, without requiring modifications to the serving system.

  • RAG Systems

    • METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation [Paper] [arXiv]

      • Chicago & Princeton & MSR

      • Jointly schedule queries and adapt the key RAG configurations of each query (e.g., the number of retrieved text chunks, synthesis methods).

    • HeteRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows [Paper] [arXiv] [Artifact]

      • UCSD

      • RAGraph, a graph-based abstraction → Expose optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness.

  • KV Cache Management

    • DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [Paper]

      • Huawei & CUHK & SJTU

      • Exploit three levels of differentiation in the KV cache:

        • The differing impact of keys and values on attention computation.

        • The varying importance of tokens.

        • The diverse dynamic sparsity patterns across attention heads.

      • An on-GPU memory manager → Compact fragmented free memory list into contiguous regions in parallel.

  • Multi-GPU Operator Optimization

    • Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling [Paper] [Artifact]

      • UCSD & Meta

      • A multi-GPU operator compiler based on a loop-based intermediate representation, CommIR.

      • Treat remote GPU memory as an explicitly managed extension of the memory hierarchy.

      • Automatically reproduce the performance of hand-optimized baselines like RingAttention and Ulysses.

MoE

  • MoE Inference

    • KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models [Paper] [Code]

      • THU & Approaching.AI

      • Employ optimized, AMX-specialized kernels to fully utilize the computational capabilities of modern CPUs and incorporate an asynchronous CPU-GPU task scheduling mechanism to minimize overhead.

      • Expert Deferral → Overlap CPU and GPU computations.

Distributed Training

  • Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters [Paper]

    • ETH & MIT & HES-SO

    • Combine an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework → Support different types of heterogeneity to optimize training throughput and cost.

Deep Learning Compilation

  • Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs [Paper]

    • ICL

    • Temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps.

    • Construct a symbolic dependence graph → Concisely encode dynamic dependencies between operators, and apply whole-program optimizations.

GPU

  • GPU OS

    • LithOS: An Operating System for Efficient Machine Learning on GPUs [Paper] [arXiv]

      • CMU & Meta

      • A TPC Scheduler → Support spatial scheduling at the granularity of individual TPCs.

      • A kernel atomizer → Reduce head-of-line blocking and allow dynamic resource reallocation mid-execution.

      • A lightweight hardware right-sizing mechanism → Dynamically determine the minimal TPC resources needed per atom.

      • A power management mechanism → Reduce power consumption based upon in-flight work characteristics.

      • Built in Rust.

  • GPU Checkpointing

    • PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation [Paper] [arXiv]

      • SJTU IPADS

      • Proactively detect GPU memory reads and writes through a two-step process:

        • Speculate about GPU memory accesses based on the arguments used when launching GPU kernels.

        • Validate these accesses efficiently at runtime using binary instrumentation.

      • Coordinated checkpoint data transfer and execution context pool.

  • GPU Storage

    • Managing Scalable Direct Storage Accesses for GPUs with GoFS [Paper]

      • UIUC

      • GPU-orchestrated file system to offload the storage management to the GPU → Scale the direct storage accesses for GPU programs.

RDMA

  • Live Migration

    • Device-Assisted Live Migration of RDMA Devices [Paper]

      • NVIDIA

      • A generic device-hypervisor interface.

      • The design and implementation of live migration support for the NVIDIA ConnectX family of network adapters.

      • Quiesce direct communication over the memory fabric (e.g., PCIe).

CXL

  • PCIe Pooling

    • Oasis: Pooling PCIe Devices Over CXL to Boost Utilization [Paper]

      • Columbia & Microsoft Azure

      • Provide a control plane and datapath over CXL pools → Map and route PCIe device traffic across host boundaries.

OS

  • Proto: A Guided Journey through Modern OS Construction [Paper]

    • UVA

  • How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service [Paper]

    • SJTU IPADS & Huawei

    • Copier, a new OS service of coordinated asynchronous copy, to serve both user-mode applications and OS services.

Resource Management

  • Serverless Computing

    • Unlocking True Elasticity for the Cloud-Native Era with Dandelion [Paper]

      • ETH

      • Dandelion, an elastic cloud platform with a declarative cloud-native programming model that replaces POSIX-based network interfaces with higher-level (e.g., HTTP-based) interfaces for applications to interact with remote services (e.g., cloud storage, databases, and AI inference services).

      • Execute applications expressed as DAGs of pure compute functions and communication functions.

    • Quilt: Resource-aware Merging of Serverless Workflows [Paper]

      • UPenn

      • Automatically merge workflows that consist of many functions (possibly in different languages) into one process → Avoid high invocation latency, communication overhead, and long chains of cold starts.

  • Resource Allocation

    • COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization [Paper]

      • Microsoft & Meta & CMU

      • Reframe round-based resource allocation as a sequence of interconnected problems.

      • Provide a method for continual optimization of LP and MILP formulations of resource allocation problems.

  • Cloud Deployment

    • Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds [Paper]

      • CMU & Uber

Video

  • SAND: A New Programming Abstraction for Video-based Deep Learning [Paper]

    • KAIST

    • Integrate system-level optimizations to simplify the preprocessing pipeline and maximize resource efficiency.

Acronyms

  • OS: Operating System

  • LLM: Large Language Model

  • MoE: Mixture-of-Experts

  • RAG: Retrieval Augmented Generation

  • RDMA: Remote Direct Memory Access

  • CXL: Compute Express Link

  • LP: Linear Program

  • MILP: Mixed Integer Linear Program

Last updated

Was this helpful?