NSDI 2025
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
Total: 12.5% (= 83 / 666)
Fall: 13.7% (= 55 / 401)
Spring: 10.6% (= 28 / 265)
LLM Training
Minder: Faulty Machine Detection for Large-scale Distributed Model Training []
THU & ByteDance & NEU & Harvard
Automatically and efficiently detect faulty distinctive monitoring metric patterns.
Holmes: Localizing Irregularities in LLM Training with Mega-scale GPU Clusters []
FDU & Tencent & UChicago
Evolution of Aegis: Fault Diagnosis for AI Model Training Cloud Service in Production []
Alibaba Cloud
Accelerating Design Space Exploration for LLM Training Systems with Multi-experiment Parallel Simulation []
THU & Zhongguancun Lab & UPenn
SimAI: Unifying Architecture Design and Performance Tunning for Large-Scale Large Language Model Training with Scalability and Precision []
Alibaba Cloud
Reinforcement Learning with Human Feedback (RLHF)
Optimizing RLHF Training for Large Language Models with Stage Fusion [] []
PKU & StepFun
Checkpointing
BCP: A Unified Checkpointing System for Large Foundation Model Development []
HKU & ByteDance
HKUST & Alibaba
GaTech & UC Berkeley & Adobe
USTC & Microsoft
Purdue & NVIDIA & VMware Research & Feldera
UW & Raytheon BBN Technologies & MIT
Remote Direct Memory Access (RDMA)
UW & UW-Madison
Alibaba Cloud
Application Networks
UW & Duke
Container Overlay Network
SJTU & Broadcom
Placement
Google & USC & Harvard & UCLA & Columbia
Network Mitigation
USC & Microsoft
Granular Management
MIT & Brown & USC & VMware Research
Provide developers with familiar, high-level abstractions (e.g., data structures, batch computing); decompose them into resource proclets, granular units that each primarily consume resources of one type; split, merge, and migrate resource proclets in milliseconds.
ICL
Resource Scheduling
HKUST
Serverless Computing
UW-Madison
Userspace Scheduling
UCSD
UMich & SJTU
UCAS & PKU & Huawei Cloud & SJTU
UCSD & Technion & VMware Research
Princeton
GPU-Disaggregated Serving for Deep Learning Recommendation Models at Scale []
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads []
AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training []
OptiReduce: Resilient and Tail-Optimal AllReduce for Distributed Deep Learning in the Cloud []
Efficient Direct-Connect Topologies for Collective Communications []
White-Boxing RDMA with Packet-Granular Software Control []
Mitigating Scalability Walls of RDMA-based Container Networks []
High-level Programming for Application Networks []
ONCache: A Cache-Based Low-Overhead Container Overlay Network []
Preventing Network Bottlenecks: Accelerating Datacenter Services with Hotspot-Aware Placement for Compute and Storage []
Enhancing Network Failure Mitigation with Performance-Aware Ranking []
Quicksand: Harnessing Stranded Datacenter Resources with Granular Computing []
GRANNY: Granular Management of Compute-Intensive Applications in the Cloud []
GREEN: Carbon-efficient Resource Scheduling for Machine Learning Clusters []
Making Serverless Pay-For-Use a Reality with Leopard []
The Benefits and Limitations of User Interrupts for Preemptive Userspace Scheduling []
One-Size-Fits-None: Understanding and Enhancing Slow Fault Tolerance in Modern Distributed Systems []
Beehive: A Scalable Disaggregated Memory Runtime Exploiting Asynchrony of Multithreaded Programs []
Eden: Developer-Friendly Application-Integrated Far Memory []
Mowgli: A Passive Approach to Learning Real-Time Rate Control for Video Conferencing []