# NSDI 2026

## Meta Info

Homepage: <https://www.usenix.org/conference/nsdi26>

Paper list: <https://www.usenix.org/conference/nsdi26/technical-sessions>

### Acceptance Rate

* Spring: 24.2% (= 50 / 207)
* Fall: 22.1% (= 100 / 452)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * MoE Training
    * SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/skiadopoulos)] \[[arXiv](https://arxiv.org/abs/2504.19925)]
      * Stanford & NVIDIA & OpenAI
      * Propose **SYMI**, an adaptive MoE training system that decouples the placement of expert parameters from their large optimizer states.
      * Statically partition optimizer states across training nodes while dynamically adjusting expert parameter placement using existing weight updates, avoiding frequent state migration overheads.
      * Improve time-to-convergence by 30.5% over DeepSpeed and 25.9% over FlexMoE.
    * Checkpoint Lite, Recover Right: Efficient Fault Tolerant Training of Mixture-of-Experts Models Using Sparse Checkpoints \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/gandhi)]
      * Stanford & NVIDIA
  * Cross-Cluster Training
    * Di-PS: System-Algorithm Co-Design for Asynchronous and Heterogeneous Cross-Cluster LLM Training at Scale \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/li-shengwei)]
      * NUDT & Shanghai AI Lab & NTU
  * Fault Tolerance
    * Attack of the Bubbles: Straggler-Resilient Pipeline Parallelism for Large Model Training \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wu-tianyuan)] \[[arXiv](https://arxiv.org/abs/2504.19232)]
      * HKUST & Alibaba
      * Present a straggler-resilient hybrid-parallel training system for pipeline-parallel large-model training under communication stragglers.
      * Adapt the pipeline schedule with an analytical model to absorb slow communication without cascading bubbles, and offload delayed communication to host memory with CPU-side RDMA to avoid GPU head-of-line blocking.
      * Reduce training iteration time by 1.2x to 3.5x under various straggler settings.
    * Flare: Anomaly Diagnostics for Divergent LLM Training in GPU Clusters of Thousand-Plus Scale \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/cui)] \[[arXiv](https://arxiv.org/abs/2502.05413)]
      * SJTU & Ant Group & NUS
      * Introduce **Flare**, a diagnostic framework for distributed LLM training at scale.
      * Combine a lightweight tracing daemon for full-stack, backend-extensible tracing with a diagnostic engine that automatically diagnoses anomalies, especially performance regressions.
      * Demonstrate deployment across 6,000 GPUs with continuous operation for over eight months in production scenarios.
    * EROICA: Online Performance Troubleshooting for Large-scale Model Training in Production \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/guan-yu)] \[[arXiv](https://arxiv.org/abs/2506.08528)]
      * Alibaba
      * Present **EROICA**, an online troubleshooting system for large-scale model training that combines fine-grained profiling with full-cluster coverage.
      * Summarize runtime execution patterns through online profiling and use differential observability to localize hardware, software, and mixed root causes with minimal production impact.
      * Report deployment on production GPU clusters of about 100,000 GPUs for 1.5 years, diagnosing difficult performance issues with 97.5% success.
  * RL Post-Training
    * Flexes: Taming Long-Tail Rollouts for RL Post-Training with Tail Batching \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/gao-wei)] \[[arXiv](https://arxiv.org/abs/2509.21009)]
      * HKUST & Alibaba
      * Introduce tail batching, a rollout scheduling strategy for synchronous RL post-training that packs prompts with long-tail responses into a small number of long rounds while keeping most rounds balanced and short.
      * Combine tail batching with elastic rollout parallelism, dynamic reward-stage resource scheduling, and stream-based training to reduce rollout bubbles without relaxing synchronization.
      * Cut end-to-end training time by 2.03x to 2.56x over veRL and by up to 2.24x over RLHFuse on Qwen2.5 models.
  * Performance Modeling and Simulation
    * Supercharging Packet-Level Network Simulation of Large Model Training via Memoization and Fast-Forwarding \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/long)] \[[arXiv](https://arxiv.org/abs/2602.10615)]
      * THU & Zhongguancun Laboratory & BUPT & Huawei
      * Propose **Wormhole**, a user-transparent packet-level discrete-event simulation kernel for large-model training that reduces redundant simulation work without simplifying the network model.
      * Memoize unsteady states and fast-forward steady states through network partitioning, state reuse, and rate-based steady-state identification while preserving simulation consistency.
      * Achieve 744x speedup over ns-3 with less than 1% error; combined with multithreading, the speedup reaches 1012x.
    * GPUSynth: Maximizing Code Reuse in Simulation-Based Machine Learning System Performance Estimation \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/qin)] \[[arXiv](https://arxiv.org/abs/2505.01616)]
      * Duke & UC Berkeley & Meta
      * Introduce **Phantora**, a hybrid GPU-cluster simulator for ML training performance estimation that runs unmodified ML frameworks in a distributed containerized environment.
      * Intercept GPU and communication operations during execution, enabling direct reuse of ML framework source code instead of reimplementing frameworks inside a simulator.
      * Match the accuracy of static workload simulation while supporting three state-of-the-art LLM training frameworks out of the box on a single GPU.
* LLM Inference
  * Request Scheduling
    * FastServe: Iteration-Level Preemptive Scheduling for Large Language Model Inference \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wu-bingyang)] \[[arXiv](https://arxiv.org/abs/2305.05920)]
      * PKU
      * Propose **FastServe**, an LLM serving system that enables iteration-level preemptive scheduling for autoregressive decoding instead of request-level FIFO execution.
      * Combine a skip-join multi-level feedback queue scheduler with a parallelism-aware suspension strategy that downscales preempted models into CPU/NVMe storage.
      * Improve multi-model serving throughput and latency; the abstract reports up to 5.1x higher throughput and 13.9x to 111.8x lower latency than prior systems.
    * JITServe: SLO-aware LLM Serving with Imprecise Request Information \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/zhang-wei)] \[[arXiv](https://arxiv.org/abs/2504.20068)]
      * UIUC & NJU & Google Labs & Cisco Research
      * Propose **JITServe**, an LLM serving system for workloads where request information is imprecise at arrival time.
      * Forecast future decoding lengths from near-future token generation and construct just-in-time execution schedules that balance prediction accuracy against rapidly changing system dynamics.
      * Increase throughput while improving latency SLO satisfaction; the abstract reports 1.8x to 7.5x higher throughput and 2.2x to 8.7x better SLO attainment than prior systems.
    * Libra: Flexible Request Partitioning and Scheduling for Serving Unbalanced and Dynamic LLM Workloads \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/ruan-libra)]
      * NUS & USTC & UC Berkeley
  * KV Cache Management
    * DroidSpeak: KV Cache Sharing Across Fine-tuned Model Variants \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/liu-yuhan)] \[[arXiv](https://arxiv.org/abs/2411.02820)]
      * UChicago & Microsoft
      * Introduce **DroidSpeak**, the first distributed LLM inference system that reuses prefix KV caches across different LLMs with the same architecture, including across distributed nodes.
      * Selectively recompute a small subset of layers from another model's KV cache and reuse the remaining layers, then pipeline recomputation with reused-cache loading to improve performance while preserving quality.
      * Improve throughput by up to 4x and prefill latency by about 3.1x with negligible quality loss.
    * SYMPHONY: Improving Memory Management for LLM Inference Workloads \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/agarwal)] \[[arXiv](https://arxiv.org/abs/2412.16434)]
      * UT Austin & UW-Madison
      * Exploit hints from multi-turn workloads to migrate KV caches off the critical serving path instead of recomputing them or pinning serving to specific machines through host-memory offload.
      * Dynamically migrate KV caches to enable fine-grained scheduling of inference requests across the cluster.
      * Handle more than 8x as many requests as state-of-the-art baselines while preserving a similar latency profile.
  * Workload Characterization
    * Seshat: Workload Characterization and Generation of Large Language Model Serving in Production \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/xiang-servegen)] \[[arXiv](https://arxiv.org/abs/2505.09999)]
      * PKU & Alibaba
      * Characterize production LLM serving workloads from a worldwide cloud inference service, covering language, multimodal, and reasoning models.
      * Build a per-client workload generation framework that composes realistic serving workloads from the observed production characteristics.
      * Avoid 50% under-provisioning compared with naive workload generation in a production validation case.
  * Serverless Computing
    * HydraServe: Minimizing Cold Start Latency for Serverless LLM Serving in Public Clouds \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/lou)] \[[arXiv](https://arxiv.org/abs/2502.15524)]
      * PKU & Alibaba Cloud
      * Present **HydraServe**, a serverless LLM serving system for public clouds that minimizes cold-start latency.
      * Proactively distribute models across servers, overlap cold-start stages within workers, place workers to avoid GPU-network contention, and consolidate pipelines to reduce cold-start resource usage.
      * Reduce cold-start latency by 1.7x to 4.7x and improve SLO attainment by 1.43x to 1.74x over baselines.
  * Multiplexing
    * FlexLLM: Token-Level Co-Serving of LLM Inference and Finetuning with SLO Guarantees \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/oliaro)] \[[arXiv](https://arxiv.org/abs/2402.18789)]
      * CMU & Purdue & Anthropic & Mistral AI & Stanford
      * Introduce **FlexLLM**, the first system to co-serve LLM inference and parameter-efficient fine-tuning on shared GPUs by fusing computation at the token level.
      * Use dependent parallelization and graph pruning to shrink activation memory, then interleave inference and training tokens with token-level finetuning and a hybrid token scheduler to meet latency SLOs.
      * Save up to 80% of GPU memory and improve finetuning throughput by 1.9x to 4.8x under heavy inference load and 2.5x to 6.8x under light load.
  * LLM Agents
    * Agentix: An Efficient Serving Engine for LLM Agents as General Programs \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/luo)] \[[arXiv](https://arxiv.org/abs/2502.13965)]
      * UC Berkeley & Google DeepMind & SJTU
      * Treat agent programs as first-class scheduling entities in LLM serving to reduce end-to-end latency for agentic workloads.
      * Intercept program-issued LLM calls to expose program-level context and preemptively prioritize calls based on previously completed work for both single-threaded and distributed programs.
      * Improve program throughput by 4x to 15x at the same latency compared with systems such as vLLM.
    * Cortex: Achieving Low-Latency, Cost-Efficient Remote Data Access For LLM via Semantic-Aware Knowledge Caching \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/ruan-cortex)] \[[arXiv](https://arxiv.org/abs/2509.17360)]
      * NUS & USTC & UofT & Sea AI Lab
      * Introduce **Cortex**, a cross-region knowledge caching architecture for LLM agents that targets semantic reuse rather than exact-match query reuse.
      * Build semantic-aware retrieval on Semantic Element and Semantic Retrieval Index abstractions, then add semantic-aware cache hits, eviction, prefetching, and a colocated lightweight LLM judger.
      * Increase throughput by up to 3.6x on search workloads while preserving accuracy close to uncached baselines, and improve coding-task throughput by 20%.
  * MoE Inference
    * SwiftEP: Accelerating MoE Inference with Buffer Fusion and TMA Offloading \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/li-xingyi)]
      * Tencent & Nanjing University & Zhongguancun Laboratory
* LLM Fine-Tuning
  * MuxTune: Efficient Multi-Task LLM Fine-Tuning in Multi-Tenant Datacenters via Spatial-Temporal Backbone Multiplexing \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/xue-chunyu)] \[[arXiv](https://arxiv.org/abs/2603.02885)]
    * SJTU & NUS
    * Present **MuxTune**, a multi-tenant fine-tuning system that spatially and temporally multiplexes the shared backbone across concurrent parameter-efficient fine-tuning tasks.
    * Build on unified fine-tuning representations with hierarchical co-scheduling across tasks, operators, and data, including hybrid spatial-temporal multiplexing, two-tier hybrid parallelism, and chunk-based data alignment.
    * Achieve up to 2.33x higher throughput and 5.29x lower memory usage than prior baselines.
* LLM Storage
  * ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/wang-zirui)] \[[arXiv](https://arxiv.org/abs/2505.06252)] \[[Homepage](https://storageai.github.io/ZLLM/)]
    * UVA & Harvard
    * Characterize all publicly available Hugging Face LLM repositories and identify structured sparse deltas within model families, bitwise-similarity-based family clustering, and tensor-level deduplication as the right storage granularity.
    * Design **BitX**, a lossless delta compression algorithm for XORed differences between fine-tuned and base models, and build **ZipLLM** to unify tensor-level deduplication with BitX compression.
    * Reduce model storage consumption by 54%, over 20% better than prior deduplication and compression approaches.

### Distributed Training

* Checkpointing
  * Checkmate: Zero Performance Overhead Model Checkpointing via Network Gradient Replication \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/bhardwaj)] \[[arXiv](https://arxiv.org/abs/2507.13522)]
    * Tufts & MIT
    * Introduce **Checkmate**, a model checkpointing system that reuses network-replicated gradients to avoid additional network transfer and disk I/O overhead on the normal checkpointing path.
    * Use dynamically reconfigurable in-network checkpoint replica placement and a failure-resilient accelerator pipeline to preserve resilience under failures while overlapping checkpoint creation with gradient computation and propagation.
    * Report nearly 100% throughput improvement over GPU-optimized checkpointing and up to 13.7% over asynchronous checkpointing on a 32-node GPU cluster.
* Collective Communication
  * FAST: An Efficient Scheduler for All-to-All GPU Communication \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/lei-yiran)] \[[arXiv](https://arxiv.org/abs/2505.09764)]
    * CMU & MangoBoost & UW & UPenn
    * Present **FAST**, an efficient All-to-All(v) scheduler for modern ML workloads, especially MoE models on heterogeneous two-tier GPU fabrics.
    * Address workload skew through intra-server rebalancing and enforce balanced one-to-one scale-out transfers to avoid incast congestion.
    * Outperform prior schedulers on skewed workloads while reducing schedule synthesis time by orders of magnitude.
  * HeteCCL: Synthesizing Near-Optimal Collective Communication Schedules for Heterogeneous GPU Clusters \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/hei)]
    * Northeastern University & Alibaba Cloud & SIAT, CAS
  * ForestColl: Throughput-Optimal Collective Communications on Heterogeneous Network Fabrics \[[Paper](https://www.usenix.org/conference/nsdi26/presentation/zhao-liangyu)] \[[arXiv](https://arxiv.org/abs/2402.06787)]
    * UW & THU & Microsoft
    * Present **ForestColl**, a schedule generation tool that constructs broadcast and aggregation spanning trees to produce throughput-optimal collective communication schedules for arbitrary network topologies.
    * Achieve theoretical optimality with polynomial-time schedule generation while supporting both switching fabrics and direct accelerator interconnects.
    * Outperform vendor communication libraries and prior schedule generation methods on AMD MI250 and NVIDIA DGX A100 and H100 clusters, including LLM training workloads.

## Acronyms

* KV: Key-Value
* LLM: Large Language Model
* MoE: Mixture-of-Experts
* RL: Reinforcement Learning
* SLO: Service Level Objective


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/nsdi-2026.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
