# NSDI 2026

## Meta Info

Homepage: <https://www.usenix.org/conference/nsdi26>

Paper list: <https://www.usenix.org/conference/nsdi26/technical-sessions>

### Acceptance Rate

* Spring: 24.2% (= 50 / 207)
* Fall: 22.1% (= 100 / 452)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * MoE Training
    * SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/skiadopoulos)] \[[arXiv](https://arxiv.org/abs/2504.19925)]
      * Stanford & NVIDIA & OpenAI
      * Propose **SYMI**, an adaptive MoE training system that decouples the placement of expert parameters from their large optimizer states.
      * Statically partition optimizer states across training nodes while dynamically adjusting expert parameter placement using existing weight updates, avoiding frequent state migration overheads.
      * Improve time-to-convergence by 30.5% over DeepSpeed and 25.9% over FlexMoE.
    * Checkpoint Lite, Recover Right: Efficient Fault Tolerant Training of Mixture-of-Experts Models Using Sparse Checkpoints \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/gandhi)]
      * Stanford & NVIDIA
  * Cross-Cluster Training
    * Di-PS: System-Algorithm Co-Design for Asynchronous and Heterogeneous Cross-Cluster LLM Training at Scale \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/li-shengwei)]
      * NUDT & Shanghai AI Lab & NTU
  * Fault Tolerance
    * Attack of the Bubbles: Straggler-Resilient Pipeline Parallelism for Large Model Training \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wu-tianyuan)] \[[arXiv](https://arxiv.org/abs/2504.19232)]
      * HKUST & Alibaba
      * Present a straggler-resilient hybrid-parallel training system for pipeline-parallel large-model training under communication stragglers.
      * Adapt the pipeline schedule with an analytical model to absorb slow communication without cascading bubbles, and offload delayed communication to host memory with CPU-side RDMA to avoid GPU head-of-line blocking.
      * Reduce training iteration time by 1.2x to 3.5x under various straggler settings.
    * Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/cui)] \[[arXiv](https://arxiv.org/abs/2502.05413)]
      * SJTU & Ant Group & NUS
      * Introduce **Flare**, a diagnostic framework for distributed LLM training at scale.
      * Combine a lightweight tracing daemon for full-stack, backend-extensible tracing with a diagnostic engine that automatically diagnoses anomalies, especially performance regressions.
      * Demonstrate deployment across 6,000 GPUs with continuous operation for over eight months in production scenarios.
    * EROICA: Online Performance Troubleshooting for Large-scale Model Training in Production \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/guan-yu)] \[[arXiv](https://arxiv.org/abs/2506.08528)]
      * Alibaba
      * Present **EROICA**, an online troubleshooting system for large-scale model training that combines fine-grained profiling with full-cluster coverage.
      * Summarize runtime execution patterns through online profiling and use differential observability to localize hardware, software, and mixed root causes with minimal production impact.
      * Report deployment on production GPU clusters of about 100,000 GPUs for 1.5 years, diagnosing difficult performance issues with 97.5% success.
  * RL Post-Training
    * Flexes: Taming Long-Tail Rollouts for RL Post-Training with Tail Batching \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/gao-wei)] \[[arXiv](https://arxiv.org/abs/2509.21009)]
      * HKUST & Alibaba
      * Introduce tail batching, a rollout scheduling strategy for synchronous RL post-training that packs prompts with long-tail responses into a small number of long rounds while keeping most rounds balanced and short.
      * Combine tail batching with elastic rollout parallelism, dynamic reward-stage resource scheduling, and stream-based training to reduce rollout bubbles without relaxing synchronization.
      * Cut end-to-end training time by 2.03x to 2.56x over veRL and by up to 2.24x over RLHFuse on Qwen2.5 models.
  * Performance Modeling and Simulation
    * Supercharging Packet-Level Network Simulation of Large Model Training via Memoization and Fast-Forwarding \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/long)] \[[arXiv](https://arxiv.org/abs/2602.10615)]
      * THU & Zhongguancun Laboratory & BUPT & Huawei
      * Propose **Wormhole**, a user-transparent packet-level discrete-event simulation kernel for large-model training that reduces redundant simulation work without simplifying the network model.
      * Memoize unsteady states and fast-forward steady states through network partitioning, state reuse, and rate-based steady-state identification while preserving simulation consistency.
      * Achieve 744x speedup over ns-3 with less than 1% error; combined with multithreading, the speedup reaches 1012x.
    * GPUSynth: Maximizing Code Reuse in Simulation-Based Machine Learning System Performance Estimation \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/qin)] \[[arXiv](https://arxiv.org/abs/2505.01616)]
      * Duke & UC Berkeley & Meta
      * Introduce **Phantora**, a hybrid GPU-cluster simulator for ML training performance estimation that runs unmodified ML frameworks in a distributed containerized environment.
      * Intercept GPU and communication operations during execution, enabling direct reuse of ML framework source code instead of reimplementing frameworks inside a simulator.
      * Match the accuracy of static workload simulation while supporting three state-of-the-art LLM training frameworks out of the box on a single GPU.
* LLM Inference
  * Request Scheduling
    * FastServe: Iteration-Level Preemptive Scheduling for Large Language Model Inference \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wu-bingyang)] \[[arXiv](https://arxiv.org/abs/2305.05920)]
      * PKU
      * Propose **FastServe**, an LLM serving system that enables iteration-level preemptive scheduling for autoregressive decoding instead of request-level FIFO execution.
      * Combine a skip-join multi-level feedback queue scheduler with a parallelism-aware suspension strategy that downscales preempted models into CPU/NVMe storage.
      * Improve multi-model serving throughput and latency; the abstract reports up to 5.1x higher throughput and 13.9x to 111.8x lower latency than prior systems.
    * JITServe: SLO-aware LLM Serving with Imprecise Request Information \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/zhang-wei)] \[[arXiv](https://arxiv.org/abs/2504.20068)]
      * UIUC & NJU & Google Labs & Cisco Research
      * Propose **JITServe**, an LLM serving system for workloads where request information is imprecise at arrival time.
      * Forecast future decoding lengths from near-future token generation and construct just-in-time execution schedules that balance prediction accuracy against rapidly changing system dynamics.
      * Increase throughput while improving latency SLO satisfaction; the abstract reports 1.8x to 7.5x higher throughput and 2.2x to 8.7x better SLO attainment than prior systems.
    * Libra: Flexible Request Partitioning and Scheduling for Serving Unbalanced and Dynamic LLM Workloads \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/ruan-libra)]
      * NUS & USTC & UC Berkeley
  * KV Cache Management
    * DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/liu-yuhan)] \[[arXiv](https://arxiv.org/abs/2411.02820)]
      * UChicago & Microsoft
      * Introduce **DroidSpeak**, the first distributed LLM inference system that reuses prefix KV caches across different LLMs with the same architecture, including across distributed nodes.
      * Selectively recompute a small subset of layers from another model's KV cache and reuse the remaining layers, then pipeline recomputation with reused-cache loading to improve performance while preserving quality.
      * Improve throughput by up to 4x and prefill latency by about 3.1x with negligible quality loss.
    * SYMPHONY: Improving Memory Management for LLM Inference Workloads \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/agarwal)] \[[arXiv](https://arxiv.org/abs/2412.16434)]
      * UT Austin & UW-Madison
      * Exploit hints from multi-turn workloads to migrate KV caches off the critical serving path instead of recomputing them or pinning serving to specific machines through host-memory offload.
      * Dynamically migrate KV caches to enable fine-grained scheduling of inference requests across the cluster.
      * Handle more than 8x as many requests as state-of-the-art baselines while preserving a similar latency profile.
  * Workload Characterization
    * Seshat: Workload Characterization and Generation of Large Language Model Serving in Production \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/xiang-servegen)] \[[arXiv](https://arxiv.org/abs/2505.09999)]
      * PKU & Alibaba
      * Characterize production LLM serving workloads from a worldwide cloud inference service, covering language, multimodal, and reasoning models.
      * Build a per-client workload generation framework that composes realistic serving workloads from the observed production characteristics.
      * Avoid 50% under-provisioning compared with naive workload generation in a production validation case.
  * Serverless Computing
    * HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/lou)] \[[arXiv](https://arxiv.org/abs/2502.15524)]
      * PKU & Alibaba Cloud
      * Present **HydraServe**, a serverless LLM serving system for public clouds that minimizes cold-start latency.
      * Proactively distribute models across servers, overlap cold-start stages within workers, place workers to avoid GPU-network contention, and consolidate pipelines to reduce cold-start resource usage.
      * Reduce cold-start latency by 1.7x to 4.7x and improve SLO attainment by 1.43x to 1.74x over baselines.
  * Multiplexing
    * FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/oliaro)] \[[arXiv](https://arxiv.org/abs/2402.18789)]
      * CMU & Purdue & Anthropic & Mistral AI & Stanford
      * Introduce **FlexLLM**, the first system to co-serve LLM inference and parameter-efficient fine-tuning on shared GPUs by fusing computation at the token level.
      * Use dependent parallelization and graph pruning to shrink activation memory, then interleave inference and training tokens with token-level finetuning and a hybrid token scheduler to meet latency SLOs.
      * Save up to 80% of GPU memory and improve finetuning throughput by 1.9x to 4.8x under heavy inference load and 2.5x to 6.8x under light load.
  * LLM Agents
    * Agentix: An Efficient Serving Engine for LLM Agents as General Programs \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/luo)] \[[arXiv](https://arxiv.org/abs/2502.13965)]
      * UC Berkeley & Google DeepMind & SJTU
      * Treat agent programs as first-class scheduling entities in LLM serving to reduce end-to-end latency for agentic workloads.
      * Intercept program-issued LLM calls to expose program-level context and preemptively prioritize calls based on previously completed work for both single-threaded and distributed programs.
      * Improve program throughput by 4x to 15x at the same latency compared with systems such as vLLM.
    * Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/ruan-cortex)] \[[arXiv](https://arxiv.org/abs/2509.17360)]
      * NUS & USTC & UofT & Sea AI Lab
      * Introduce **Cortex**, a cross-region knowledge caching architecture for LLM agents that targets semantic reuse rather than exact-match query reuse.
      * Build semantic-aware retrieval on Semantic Element and Semantic Retrieval Index abstractions, then add semantic-aware cache hits, eviction, prefetching, and a colocated lightweight LLM judger.
      * Increase throughput by up to 3.6x on search workloads while preserving accuracy close to uncached baselines, and improve coding-task throughput by 20%.
  * MoE Inference
    * SwiftEP: Accelerating MoE Inference with Buffer Fusion and TMA Offloading \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/li-xingyi)]
      * Tencent & Nanjing University & Zhongguancun Laboratory
* LLM Fine-Tuning
  * MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/xue-chunyu)] \[[arXiv](https://arxiv.org/abs/2603.02885)]
    * SJTU & NUS
    * Present **MuxTune**, a multi-tenant fine-tuning system that spatially and temporally multiplexes the shared backbone across concurrent parameter-efficient fine-tuning tasks.
    * Build on unified fine-tuning representations with hierarchical co-scheduling across tasks, operators, and data, including hybrid spatial-temporal multiplexing, two-tier hybrid parallelism, and chunk-based data alignment.
    * Achieve up to 2.33x higher throughput and 5.29x lower memory usage than prior baselines.
* LLM Storage
  * ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wang-zirui)] \[[arXiv](https://arxiv.org/abs/2505.06252)] \[[Homepage](https://storageai.github.io/ZLLM/)]
    * UVA & Harvard
    * Characterize all publicly available Hugging Face LLM repositories and identify structured sparse deltas within model families, bitwise-similarity-based family clustering, and tensor-level deduplication as the right storage granularity.
    * Design **BitX**, a lossless delta compression algorithm for XORed differences between fine-tuned and base models, and build **ZipLLM** to unify tensor-level deduplication with BitX compression.
    * Reduce model storage consumption by 54%, over 20% better than prior deduplication and compression approaches.

### Distributed Training

* Checkpointing
  * Checkmate: Zero Performance Overhead Model Checkpointing via Network Gradient Replication \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/bhardwaj)] \[[arXiv](https://arxiv.org/abs/2507.13522)]
    * Tufts & MIT
    * Introduce **Checkmate**, a model checkpointing system that reuses network-replicated gradients to avoid additional network transfer and disk I/O overhead on the normal checkpointing path.
    * Use dynamically reconfigurable in-network checkpoint replica placement and a failure-resilient accelerator pipeline to preserve resilience under failures while overlapping checkpoint creation with gradient computation and propagation.
    * Report nearly 100% throughput improvement over GPU-optimized checkpointing and up to 13.7% over asynchronous checkpointing on a 32-node GPU cluster.
* Collective Communication
  * FAST: An Efficient Scheduler for All-to-All GPU Communication \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/lei-yiran)] \[[arXiv](https://arxiv.org/abs/2505.09764)]
    * CMU & MangoBoost & UW & UPenn
    * Present **FAST**, an efficient All-to-All(v) scheduler for modern ML workloads, especially MoE models on heterogeneous two-tier GPU fabrics.
    * Address workload skew through intra-server rebalancing and enforce balanced one-to-one scale-out transfers to avoid incast congestion.
    * Outperform prior schedulers on skewed workloads while reducing schedule synthesis time by orders of magnitude.
  * HeteCCL: Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/hei)]
    * Northeastern University & Alibaba Cloud & SIAT, CAS
  * ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/zhao-liangyu)] \[[arXiv](https://arxiv.org/abs/2402.06787)]
    * UW & THU & Microsoft
    * Present **ForestColl**, a schedule generation tool that constructs broadcast and aggregation spanning trees to produce throughput-optimal collective communication schedules for arbitrary network topologies.
    * Achieve theoretical optimality with polynomial-time schedule generation while supporting both switching fabrics and direct accelerator interconnects.
    * Outperform vendor communication libraries and prior schedule generation methods on AMD MI250 and NVIDIA DGX A100 and H100 clusters, including LLM training workloads.

## Acronyms

* KV: Key-Value
* LLM: Large Language Model
* MoE: Mixture-of-Experts
* RL: Reinforcement Learning
* SLO: Service Level Objective
