SOSP 2025
Meta Info
Homepage: https://sigops.org/s/conferences/sosp/2025/
Paper List
Acceptance Rate
17.7% (= 65 / 368)
Papers
LLM
LLM Training
Robust LLM Training Infrastructure at ByteDance [Paper]
HKU & ByteDance Seed
ByteRobust
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training [Paper]
CUHK & ByteDance & ByteDance Seed
A lightweight distributed tracing and root cause analysis system.
Trace collective communication states and leverage internal control and data dependencies.
DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism [Paper]
HKU & AWS
Introduce fine-grained blockwise partitioning of both data and computation.
TrainVerify: Equivalence-Based Verification for Distributed LLM Training [Paper]
UMich & MSRA
Formally verify that a distributed parallel execution plan is mathematically equivalent to the logical specification.
Introduce a stage-wise parallel verification algorithm and shape-reduction techniques → Reduce complexity while preserving formal correctness.
LLM Inference
Jenga: Effective Memory Management for Serving LLM with Heterogeneity [Paper] [arXiv]
THU & Chicago & UC Berkeley
Two challenges
Recent models have heterogeneous embeddings with different sizes.
Some new architectures use only a subset of the prefix tokens to generate the next token.
Designs
Two-level memory allocator: choose the page size as least common multiple of token embedding sizes.
Enable attention variants to customize this mechanism by precisely specifying the exact prefix subset.
PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications [Paper] [arXiv]
Chicago & THU & LinkedIn & UC Berkeley
Hybrid prefilling: Prefill non-attention layers chunk-by-chunk, but prefill the attention layers normally.
Suffix KV cache discarding / offloading: Discard the useless KV cache.
Continuous JCT calibration: Continuously reestimate the JCT of each request based on what requests are previously scheduled, and then schedule just one request with the lowest JCT.
IC-Cache: Efficient Large Language Model Serving via In-context Caching [Paper]
UIUC & Google
Leverage historical request-response pairs from larger models as in-context examples.
Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market [Paper]
PKU & Alibaba Cloud
Schedule multimodel requests and make auto-scaling decisions on a per-token basis to maximize service quality.
Reduce auto-scaling overhead through component reuse, explicit memory management, and fine-grained KV cache synchronization.
Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference [Paper]
SJTU IPADS & THU & SenseTime
LLM Applications
Pie: A Programmable Serving System for Emerging LLM Applications [Paper]
Yale
Decompose the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets.
Enable applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O—entirely within the application, without requiring modifications to the serving system.
RAG Systems
KV Cache Management
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction [Paper]
Huawei & CUHK & SJTU
Exploit three levels of differentiation in the KV cache:
The differing impact of keys and values on attention computation.
The varying importance of tokens.
The diverse dynamic sparsity patterns across attention heads.
An on-GPU memory manager → Compact fragmented free memory list into contiguous regions in parallel.
Multi-GPU Operator Optimization
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling [Paper] [Artifact]
UCSD & Meta
A multi-GPU operator compiler based on a loop-based intermediate representation, CommIR.
Treat remote GPU memory as an explicitly managed extension of the memory hierarchy.
Automatically reproduce the performance of hand-optimized baselines like RingAttention and Ulysses.
MoE
MoE Inference
KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models [Paper] [Code]
THU & Approaching.AI
Employ optimized, AMX-specialized kernels to fully utilize the computational capabilities of modern CPUs and incorporate an asynchronous CPU-GPU task scheduling mechanism to minimize overhead.
Expert Deferral → Overlap CPU and GPU computations.
Distributed Training
Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters [Paper]
ETH & MIT & HES-SO
Combine an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework → Support different types of heterogeneity to optimize training throughput and cost.
Deep Learning Compilation
Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs [Paper]
ICL
Temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps.
Construct a symbolic dependence graph → Concisely encode dynamic dependencies between operators, and apply whole-program optimizations.
GPU
GPU OS
LithOS: An Operating System for Efficient Machine Learning on GPUs [Paper] [arXiv]
CMU & Meta
A TPC Scheduler → Support spatial scheduling at the granularity of individual TPCs.
A kernel atomizer → Reduce head-of-line blocking and allow dynamic resource reallocation mid-execution.
A lightweight hardware right-sizing mechanism → Dynamically determine the minimal TPC resources needed per atom.
A power management mechanism → Reduce power consumption based upon in-flight work characteristics.
Built in Rust.
GPU Checkpointing
PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation [Paper] [arXiv]
SJTU IPADS
Proactively detect GPU memory reads and writes through a two-step process:
Speculate about GPU memory accesses based on the arguments used when launching GPU kernels.
Validate these accesses efficiently at runtime using binary instrumentation.
Coordinated checkpoint data transfer and execution context pool.
GPU Storage
Managing Scalable Direct Storage Accesses for GPUs with GoFS [Paper]
UIUC
GPU-orchestrated file system to offload the storage management to the GPU → Scale the direct storage accesses for GPU programs.
RDMA
Live Migration
Device-Assisted Live Migration of RDMA Devices [Paper]
NVIDIA
A generic device-hypervisor interface.
The design and implementation of live migration support for the NVIDIA ConnectX family of network adapters.
Quiesce direct communication over the memory fabric (e.g., PCIe).
CXL
PCIe Pooling
Oasis: Pooling PCIe Devices Over CXL to Boost Utilization [Paper]
Columbia & Microsoft Azure
Provide a control plane and datapath over CXL pools → Map and route PCIe device traffic across host boundaries.
OS
Proto: A Guided Journey through Modern OS Construction [Paper]
UVA
How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service [Paper]
SJTU IPADS & Huawei
Copier, a new OS service of coordinated asynchronous copy, to serve both user-mode applications and OS services.
Resource Management
Serverless Computing
Unlocking True Elasticity for the Cloud-Native Era with Dandelion [Paper]
ETH
Dandelion, an elastic cloud platform with a declarative cloud-native programming model that replaces POSIX-based network interfaces with higher-level (e.g., HTTP-based) interfaces for applications to interact with remote services (e.g., cloud storage, databases, and AI inference services).
Execute applications expressed as DAGs of pure compute functions and communication functions.
Quilt: Resource-aware Merging of Serverless Workflows [Paper]
UPenn
Automatically merge workflows that consist of many functions (possibly in different languages) into one process → Avoid high invocation latency, communication overhead, and long chains of cold starts.
Resource Allocation
COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization [Paper]
Microsoft & Meta & CMU
Reframe round-based resource allocation as a sequence of interconnected problems.
Provide a method for continual optimization of LP and MILP formulations of resource allocation problems.
Cloud Deployment
Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds [Paper]
CMU & Uber
Video
SAND: A New Programming Abstraction for Video-based Deep Learning [Paper]
KAIST
Integrate system-level optimizations to simplify the preprocessing pipeline and maximize resource efficiency.
Acronyms
OS: Operating System
LLM: Large Language Model
MoE: Mixture-of-Experts
RAG: Retrieval Augmented Generation
RDMA: Remote Direct Memory Access
CXL: Compute Express Link
LP: Linear Program
MILP: Mixed Integer Linear Program
Last updated
Was this helpful?