# NSDI 2025

## Meta Info

Homepage: <https://www.usenix.org/conference/nsdi25>

Paper list: <https://www.usenix.org/conference/nsdi25/technical-sessions>

### Acceptance Rate

* Total: 12.5% (= 83 / 666)
* Fall: 13.7% (= 55 / 401)
* Spring: 10.6% (= 28 / 265)

### Papers

### Large Language Models (LLMs)

* LLM Training
  * Minder: Faulty Machine Detection for Large-scale Distributed Model Training \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/deng)]
    * THU & ByteDance & NEU & Harvard
    * Automatically and efficiently detect faulty distinctive monitoring metric patterns.
  * Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yao)]
    * FDU & Tencent & UChicago
  * Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/dong)]
    * Alibaba Cloud
  * Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/gui)]
    * THU & Zhongguancun Lab & UPenn
  * SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-simai)]
    * Alibaba Cloud
* Reinforcement Learning with Human Feedback (RLHF)
  * Optimizing RLHF Training for Large Language Models with Stage Fusion \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhong)] \[[arXiv](https://arxiv.org/abs/2409.13221)]
    * PKU & StepFun
* Checkpointing
  * BCP: A Unified Checkpointing System for Large Foundation Model Development \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/wan-borui)]
    * HKU & ByteDance

### Deep Learning Recommendation Models (DLRMs)

* GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yang)]
  * HKUST & Alibaba

### Model Serving

* SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/khare)]
  * GaTech & UC Berkeley & Adobe

### Collective Communication

* AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin)]
  * USTC & Microsoft
* OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/warraich)]
  * Purdue & NVIDIA & VMware Research & Feldera
* Efficient Direct-Connect Topologies for Collective Communications \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhao-liangyu)]
  * UW & Raytheon BBN Technologies & MIT

### Networking

* Remote Direct Memory Access (RDMA)
  * White-Boxing RDMA with Packet-Granular Software Control \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhao-chenxingyu)]
    * UW & UW-Madison
  * Mitigating Scalability Walls of RDMA-based Container Networks \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/liu-wei)]
    * Alibaba Cloud
* Application Networks
  * High-level Programming for Application Networks \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhu)]
    * UW & Duke
* Container Overlay Network
  * ONCache: A Cache-Based Low-Overhead Container Overlay Network \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/lin-shengkai)]
    * SJTU & Broadcom
* Placement
  * Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/bazzaz)]
    * Google & USC & Harvard & UCLA & Columbia
* Network Mitigation
  * Enhancing Network Failure Mitigation with Performance-Aware Ranking \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/namyar)]
    * USC & Microsoft

### Resource Management

* Granular Management
  * Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/ruan)]
    * MIT & Brown & USC & VMware Research
      * Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
  * GRANNY: Granular Management of Compute-Intensive Applications in the Cloud \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/segarra)]
    * ICL
* Resource Scheduling
  * GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/xu-kaiqiang)]
    * HKUST
* Serverless Computing
  * Making Serverless Pay-For-Use a Reality with Leopard \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/cao)]
    * UW-Madison
* Userspace Scheduling
  * The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/guo)]
    * UCSD

### Fault Tolerance

* One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/lu)]
  * UMich & SJTU

### Memory Disaggregation

* Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/li-quanxi)]
  * UCAS & PKU & Huawei Cloud & SJTU
* Eden: Developer-Friendly Application-Integrated Far Memory \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yelam)]
  * UCSD & Technion & VMware Research

### Real-Time Video Streaming

* Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/agarwal)]
  * Princeton


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/nsdi-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
