# SOSP 2025

## Meta Info

Homepage: <https://sigops.org/s/conferences/sosp/2025/>

### Paper List

* <https://sigops.org/s/conferences/sosp/2025/accepted.html>
* <https://dl.acm.org/doi/proceedings/10.1145/3731569>

### Acceptance Rate

17.7% (= 65 / 368)

## Papers

### LLM

* LLM Training
  * Robust LLM Training Infrastructure at ByteDance \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764838)]
    * HKU & ByteDance Seed
    * **ByteRobust**
  * Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764848)]
    * CUHK & ByteDance & ByteDance Seed
    * A lightweight distributed tracing and root cause analysis system.
    * Trace collective communication states and leverage internal control and data dependencies.
  * DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764849)]
    * HKU & AWS
    * Introduce fine-grained blockwise partitioning of both data and computation.
  * TrainVerify: Equivalence-Based Verification for Distributed LLM Training \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764850)]
    * UMich & MSRA
    * Formally verify that a distributed parallel execution plan is mathematically equivalent to the logical specification.
    * Introduce a stage-wise parallel verification algorithm and shape-reduction techniques → Reduce complexity while preserving formal correctness.
* LLM Inference
  * Jenga: Effective Memory Management for Serving LLM with Heterogeneity \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764823)] \[[arXiv](https://arxiv.org/abs/2503.18292)]
    * THU & Chicago & UC Berkeley
    * Two challenges
      * Recent models have heterogeneous embeddings with different sizes.
      * Some new architectures use only a subset of the prefix tokens to generate the next token.
    * Designs
      * Two-level memory allocator: choose the page size as least common multiple of token embedding sizes.
      * Enable attention variants to customize this mechanism by precisely specifying the exact prefix subset.
  * PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764834)] \[[arXiv](https://arxiv.org/abs/2505.07203)]
    * Chicago & THU & LinkedIn & UC Berkeley
    * Hybrid prefilling: Prefill non-attention layers chunk-by-chunk, but prefill the attention layers normally.
    * Suffix KV cache discarding / offloading: Discard the useless KV cache.
    * Continuous JCT calibration: Continuously reestimate the JCT of each request based on what requests are previously scheduled, and then schedule just one request with the lowest JCT.
  * IC-Cache: Efficient Large Language Model Serving via In-context Caching \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764829)]
    * UIUC & Google
    * Leverage historical request-response pairs from larger models as in-context examples.
  * Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764815)]
    * PKU & Alibaba Cloud
    * Schedule multimodel requests and make auto-scaling decisions on a per-token basis to maximize service quality.
    * Reduce auto-scaling overhead through component reuse, explicit memory management, and fine-grained KV cache synchronization.
  * Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764808)]
    * SJTU IPADS & THU & SenseTime
* LLM Applications
  * Pie: A Programmable Serving System for Emerging LLM Applications \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764814)]
    * Yale
    * Decompose the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called *inferlets*.
    * Enable applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O—entirely within the application, without requiring modifications to the serving system.
* RAG Systems
  * METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764855)] \[[arXiv](https://arxiv.org/abs/2412.10543)]
    * Chicago & Princeton & MSR
    * Jointly schedule queries and adapt the key RAG configurations of each query (e.g., the number of retrieved text chunks, synthesis methods).
  * HeteRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764806)] \[[arXiv](https://arxiv.org/abs/2507.09138)] \[[Artifact](https://github.com/Leo9660/HedraRAG_AE)]
    * UCSD
    * RAGraph, a graph-based abstraction → Expose optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness.
* KV Cache Management
  * DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764810)]
    * Huawei & CUHK & SJTU
    * Exploit three levels of differentiation in the KV cache:
      * The differing impact of keys and values on attention computation.
      * The varying importance of tokens.
      * The diverse dynamic sparsity patterns across attention heads.
    * An on-GPU memory manager → Compact fragmented free memory list into contiguous regions in parallel.
* Multi-GPU Operator Optimization
  * Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764798)] \[[Artifact](https://github.com/ChandlerGuan/mercury_artifact)]
    * UCSD & Meta
    * A multi-GPU operator compiler based on a loop-based intermediate representation, CommIR.
    * Treat remote GPU memory as an explicitly managed extension of the memory hierarchy.
    * Automatically reproduce the performance of hand-optimized baselines like RingAttention and Ulysses.

### MoE

* MoE Inference
  * KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764843)] \[[Code](https://github.com/kvcache-ai/ktransformers)]
    * THU & Approaching.AI
    * Employ optimized, AMX-specialized kernels to fully utilize the computational capabilities of modern CPUs and incorporate an asynchronous CPU-GPU task scheduling mechanism to minimize overhead.
    * Expert Deferral → Overlap CPU and GPU computations.

### Distributed Training

* Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764839)]
  * ETH & MIT & HES-SO
  * Combine an efficient search space exploration algorithm, accurate runtime and memory footprint simulation, and a distributed training framework → Support different types of heterogeneity to optimize training throughput and cost.

### Deep Learning Compilation

* Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764840)]
  * ICL
  * Temporal relationships: a tensor at one timestep may depend on tensors from earlier or later timesteps.
  * Construct a symbolic dependence graph → Concisely encode dynamic dependencies between operators, and apply whole-program optimizations.

### GPU

* GPU OS
  * LithOS: An Operating System for Efficient Machine Learning on GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764818)] \[[arXiv](https://arxiv.org/abs/2504.15465)]
    * CMU & Meta
    * A TPC Scheduler → Support spatial scheduling at the granularity of individual TPCs.
    * A kernel atomizer → Reduce head-of-line blocking and allow dynamic resource reallocation mid-execution.
    * A lightweight hardware right-sizing mechanism → Dynamically determine the minimal TPC resources needed per atom.
    * A power management mechanism → Reduce power consumption based upon in-flight work characteristics.
    * Built in Rust.
* GPU Checkpointing
  * PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764813)] \[[arXiv](https://arxiv.org/abs/2405.12079)]
    * SJTU IPADS
    * Proactively detect GPU memory reads and writes through a two-step process:
      * Speculate about GPU memory accesses based on the arguments used when launching GPU kernels.
      * Validate these accesses efficiently at runtime using binary instrumentation.
    * Coordinated checkpoint data transfer and execution context pool.
* GPU Storage
  * Managing Scalable Direct Storage Accesses for GPUs with GoFS \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764857)]
    * UIUC
    * GPU-orchestrated file system to offload the storage management to the GPU → Scale the direct storage accesses for GPU programs.

### RDMA

* Live Migration
  * Device-Assisted Live Migration of RDMA Devices \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764795)]
    * NVIDIA
    * A generic device-hypervisor interface.
    * The design and implementation of live migration support for the NVIDIA ConnectX family of network adapters.
    * Quiesce direct communication over the memory fabric (e.g., PCIe).

### CXL

* PCIe Pooling
  * Oasis: Pooling PCIe Devices Over CXL to Boost Utilization \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764812)]
    * Columbia & Microsoft Azure
    * Provide a control plane and datapath over CXL pools → Map and route PCIe device traffic across host boundaries.

### OS

* Proto: A Guided Journey through Modern OS Construction \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764811)]
  * UVA
* How to Copy Memory? Coordinated Asynchronous Copy as a First-Class OS Service \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764800)]
  * SJTU IPADS & Huawei
  * Copier, a new OS service of *coordinated asynchronous copy*, to serve both user-mode applications and OS services.

### Resource Management

* Serverless Computing
  * Unlocking True Elasticity for the Cloud-Native Era with Dandelion \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764803)]
    * ETH
    * Dandelion, an elastic cloud platform with *a declarative cloud-native programming model* that replaces POSIX-based network interfaces with higher-level (e.g., HTTP-based) interfaces for applications to interact with remote services (e.g., cloud storage, databases, and AI inference services).
    * Execute applications expressed as DAGs of pure compute functions and communication functions.
  * Quilt: Resource-aware Merging of Serverless Workflows \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764830)]
    * UPenn
    * Automatically merge workflows that consist of many functions (possibly in different languages) into one process → Avoid high invocation latency, communication overhead, and long chains of cold starts.
* Resource Allocation
  * COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764846)]
    * Microsoft & Meta & CMU
    * Reframe round-based resource allocation as a sequence of interconnected problems.
    * Provide a method for continual optimization of LP and MILP formulations of resource allocation problems.
* Cloud Deployment
  * Moirai: Optimizing Placement of Data and Compute in Hybrid Clouds \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764802)]
    * CMU & Uber

### Video

* SAND: A New Programming Abstraction for Video-based Deep Learning \[[Paper](https://dl.acm.org/doi/10.1145/3731569.3764847)]
  * KAIST
  * Integrate system-level optimizations to simplify the preprocessing pipeline and maximize resource efficiency.

## Acronyms

* OS: Operating System
* LLM: Large Language Model
* MoE: Mixture-of-Experts
* RAG: Retrieval Augmented Generation
* RDMA: Remote Direct Memory Access
* CXL: Compute Express Link
* LP: Linear Program
* MILP: Mixed Integer Linear Program


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/sosp-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
