# NSDI 2025

## Meta Info

Homepage: <https://www.usenix.org/conference/nsdi25>

Paper list: <https://www.usenix.org/conference/nsdi25/technical-sessions>

### Acceptance Rate

* Total: 12.5% (= 83 / 666)
* Fall: 13.7% (= 55 / 401)
* Spring: 10.6% (= 28 / 265)

### Papers

### Large Language Models (LLMs)

* LLM Training
  * Minder: Faulty Machine Detection for Large-scale Distributed Model Training \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/deng)]
    * THU & ByteDance & NEU & Harvard
    * Automatically and efficiently detect faulty distinctive monitoring metric patterns.
  * Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yao)]
    * FDU & Tencent & UChicago
  * Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/dong)]
    * Alibaba Cloud
  * Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/gui)]
    * THU & Zhongguancun Lab & UPenn
  * SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/wang-xizheng-simai)]
    * Alibaba Cloud
* Reinforcement Learning with Human Feedback (RLHF)
  * Optimizing RLHF Training for Large Language Models with Stage Fusion \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhong)] \[[arXiv](https://arxiv.org/abs/2409.13221)]
    * PKU & StepFun
* Checkpointing
  * BCP: A Unified Checkpointing System for Large Foundation Model Development \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/wan-borui)]
    * HKU & ByteDance

### Deep Learning Recommendation Models (DLRMs)

* GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yang)]
  * HKUST & Alibaba

### Model Serving

* SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/khare)]
  * GaTech & UC Berkeley & Adobe

### Collective Communication

* AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/xu-guanbin)]
  * USTC & Microsoft
* OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/warraich)]
  * Purdue & NVIDIA & VMware Research & Feldera
* Efficient Direct-Connect Topologies for Collective Communications \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhao-liangyu)]
  * UW & Raytheon BBN Technologies & MIT

### Networking

* Remote Direct Memory Access (RDMA)
  * White-Boxing RDMA with Packet-Granular Software Control \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhao-chenxingyu)]
    * UW & UW-Madison
  * Mitigating Scalability Walls of RDMA-based Container Networks \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/liu-wei)]
    * Alibaba Cloud
* Application Networks
  * High-level Programming for Application Networks \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/zhu)]
    * UW & Duke
* Container Overlay Network
  * ONCache: A Cache-Based Low-Overhead Container Overlay Network \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/lin-shengkai)]
    * SJTU & Broadcom
* Placement
  * Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/bazzaz)]
    * Google & USC & Harvard & UCLA & Columbia
* Network Mitigation
  * Enhancing Network Failure Mitigation with Performance-Aware Ranking \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/namyar)]
    * USC & Microsoft

### Resource Management

* Granular Management
  * Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/ruan)]
    * MIT & Brown & USC & VMware Research
      * Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
  * GRANNY: Granular Management of Compute-Intensive Applications in the Cloud \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/segarra)]
    * ICL
* Resource Scheduling
  * GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/xu-kaiqiang)]
    * HKUST
* Serverless Computing
  * Making Serverless Pay-For-Use a Reality with Leopard \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/cao)]
    * UW-Madison
* Userspace Scheduling
  * The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/guo)]
    * UCSD

### Fault Tolerance

* One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/lu)]
  * UMich & SJTU

### Memory Disaggregation

* Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/li-quanxi)]
  * UCAS & PKU & Huawei Cloud & SJTU
* Eden: Developer-Friendly Application-Integrated Far Memory \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/yelam)]
  * UCSD & Technion & VMware Research

### Real-Time Video Streaming

* Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing \[[Paper](https://www.usenix.org/conference/nsdi25/presentation/agarwal)]
  * Princeton
