# ASPLOS 2026

## Meta Info

Homepage: <https://www.asplos-conference.org/asplos2026/>

### Paper List

* Program: <https://www.asplos-conference.org/asplos2026/program/>
* Proceedings Volume 1: <https://dl.acm.org/doi/proceedings/10.1145/3760250>
* Proceedings Volume 2: <https://dl.acm.org/doi/proceedings/10.1145/3779212>

### Acceptance Rate

* Review model: ASPLOS 2026 used two submission cycles (`Spring` and `Summer`) and retained a `Major Revision` path for selected papers.
* Spring Cycle: 9.6% (= 20 / 208)
  * Major Revision: 9.1% (= 19 / 208)
* Summer Cycle: 15.7% (= 132 / 840)
* Total: 14.5% (= 152 / 1048)

## Papers

### Large Language Models (LLMs)

* LLM Inference
  * Prefill-Decode Multiplexing
    * Towards High-Goodput LLM Serving with Prefill-decode Multiplexing \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790236)] \[[arXiv](https://arxiv.org/abs/2504.14489)]
      * SJTU & HKU & NUS
      * Propose **MuxWise**, an LLM serving framework built on intra-GPU prefill-decode multiplexing.
      * Integrate a bubble-less multiplex engine, a contention-tolerant estimator, and an SLO-aware dispatcher.
    * Bullet: Boosting GPU Utilization for LLM Serving via Dynamic Spatial-Temporal Orchestration \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790135)] \[[arXiv](https://arxiv.org/abs/2504.19516)] \[[Code](https://github.com/zejia-lin/Bullet)]
      * SYSU
      * Enable concurrent execution of prefill and decode requests.
      * Dynamically provision GPU resources based on real-time performance modeling.
    * TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill & Decode Inference \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790237)]
      * PKU & Tencent YouTu Lab
      * Introduce tensor-parallel latent attention for disaggregated prefill/decode inference.
      * Combine latent attention with tensor parallelism to improve PD-disaggregated long-context serving.
  * Scheduling
    * QoServe: Breaking the Silos of LLM Inference Serving \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790206)] \[[arXiv](https://arxiv.org/abs/2503.22562)]
      * MSR India
      * Introduce fine-grained QoS classification so applications can specify precise latency requirements, and adapt scheduling decisions to real-time system state.
      * Leverage the predictable execution characteristics of LLM inference to implement dynamic chunking for higher throughput under strict QoS guarantees.
      * Combine hybrid prioritization with selective request relegation to balance fairness, efficiency, and graceful degradation under overload.
    * Shift Parallelism: Low-Latency, High-Throughput LLM Inference for Dynamic Workloads \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790219)] \[[arXiv](https://arxiv.org/abs/2509.16495)]
      * Snowflake
      * Introduce **Shift Parallelism**, a runtime that switches across inference parallelism strategies for dynamic workloads.
      * Turn parallelism selection into a runtime control decision to jointly improve latency and throughput.
    * XY-Serve: End-to-End Versatile Production Serving for Dynamic LLM Workloads \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762228)]
      * Huawei & THU & Shanghai AI Lab
      * Present **XY-Serve**, an end-to-end serving system for dynamic production LLM workloads.
      * Coordinate scheduling, batching, and runtime resource management to sustain serving efficiency under workload variation.
    * BlendServe: Optimizing Offline Inference with Resource-Aware Batching \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790133)]
      * UC Berkeley & UW & UC Davis & Rice
      * Present a resource-aware batching framework for offline inference.
      * Form batches against actual compute and memory bottlenecks to improve throughput.
  * MoE Inference
    * MoE-APEX: An Efficient MoE Inference System with Adaptive Precision Expert Offloading \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790187)]
      * SJTU & CUHK
      * Introduce an MoE inference system with adaptive-precision expert offloading.
      * Jointly tune expert offloading and precision to reduce memory pressure during serving.
  * Compression
    * ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790250)] \[[arXiv](https://arxiv.org/abs/2603.17435)] \[[Code](https://github.com/xxyux/ZipServ)]
      * HKUST-GZ & HIT-SZ & HKUST
      * Introduce hardware-aware lossless compression for LLM inference.
      * Reduce memory footprint while preserving exact model behavior and improving serving efficiency.
  * Speculative Decoding
    * DFVG: A Heterogeneous Architecture for Speculative Decoding with Draft-on-FPGA and Verify-on-GPU \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790153)]
      * SJTU & Eastern Institute of Technology, Ningbo & Southeast University & Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo
      * Propose a heterogeneous speculative decoding architecture with FPGA draft generation and GPU verification.
      * Pipeline draft and verify across devices to reduce end-to-end decoding latency.
    * SwiftSpec: Disaggregated Speculative Decoding and Fused Kernels for Low-Latency LLM Inference \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790246)]
      * ByteDance Seed & UChicago
      * Introduce disaggregated speculative decoding together with fused kernels for low-latency LLM inference.
      * Combine system-level disaggregation and kernel-level optimization to make speculative decoding practical in deployment.
  * Sparsity
    * SpeContext: Enabling Efficient Long-context Reasoning with Speculative Context Sparsity in LLMs \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790224)]
      * SJTU & Infinigence-AI & SII & THU
      * Introduce speculative context sparsity for long-context reasoning in LLMs.
      * Avoid uniform full-context processing by speculating over sparse context usage during long-input inference.
  * Attention Mechanisms
    * I/O Analysis is All You Need: An I/O Analysis for Long-Sequence Attention \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790174)]
      * IIT & ICT, CAS & UCAS
      * Present an I/O-centric analysis framework for long-sequence attention.
      * Show that data movement, rather than FLOPs alone, dominates long-context attention cost.
    * PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790200)]
      * TJU & Stevens Institute of Technology
      * Introduce prefix-aware attention together with a multi-tile kernel for LLM decoding.
      * Reduce decode latency by exploiting shared prefixes while keeping GPU resource usage under control.
  * Value Level Parallelism (VLP)
    * Mugi: Value Level Parallelism For Efficient LLMs \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790189)]
      * CMU & UCF
      * Introduce value-level parallelism as a new execution dimension for LLM inference.
      * Exploit finer-grained parallel structure than conventional tensor or sequence parallelism.
  * KV Cache Offloading
    * REPA: Reconfigurable PIM for the Joint Acceleration of KV Cache Offloading and Processing \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790212)]
      * SJTU
      * Present a reconfigurable PIM architecture for jointly offloading and processing KV cache.
      * Co-design KV movement and KV computation to reduce host-memory bottlenecks during inference.
    * STARC: Selective Token Access with Remapping and Clustering for Efficient LLM Decoding on PIM Systems \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790226)]
      * RPI & UMass Amherst & IBM Research
      * Introduce selective token access with remapping and clustering for PIM-based LLM decoding.
      * Reduce unnecessary KV accesses and improve data locality during decoding.
* LLM Training
  * RL Post-Training
    * History Doesn't Repeat Itself but Rollouts Rhyme: Accelerating Reinforcement Learning with RhymeRL \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790172)]
      * SJTU & ByteDance
      * Present **RhymeRL**, a framework that accelerates RL by exploiting reusable structure across rollout histories.
      * Reduce redundant rollout work to improve training efficiency for LLM-aligned RL workloads.
  * Multimodal Model Training
    * DIP: Efficient Large Multimodal Model Training with Dynamic Interleaved Pipeline \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790154)]
      * SJTU & StepFun & Zenergize AI
      * Introduce a dynamic interleaved pipeline for large multimodal model training.
      * Increase pipeline utilization by interleaving stages dynamically across modalities and training phases.
  * Mixed-Precision Training
    * SNIP: An Adaptive Mixed Precision Framework for Subbyte Large Language Model Training \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790223)] \[[arXiv](https://arxiv.org/abs/2602.01410)]
      * UMich & Meta & UMass Amherst
      * Present an adaptive mixed-precision framework for subbyte LLM training.
      * Periodically profile training statistics and solve precision allocation to assign fine-grained bitwidths.
  * Diagnosis
    * Fine-grained and Non-intrusive LLM Training Monitoring via Microsecond-level Traffic Measurement \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790163)]
      * NJU & NUS & Infrawaves
      * Introduce microsecond-level traffic measurement for fine-grained, non-intrusive LLM training monitoring.
      * Infer communication and runtime behavior without intrusive application instrumentation.
  * Offloading
    * SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762217)] \[[arXiv](https://arxiv.org/abs/2509.21271)]
      * UIUC & Microsoft & Snowflake
      * Revisit large-scale LLM training on tightly coupled GPU-CPU superchips with **SuperOffload**.
      * Combine adaptive weight offloading with superchip-aware runtime optimizations to improve long-context training throughput.
* Language Processing Units (LPUs)
  * Hardwired-Neuron Language Processing Units as General-Purpose Cognitive Substrates \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790169)]
    * ICT, CAS & USTC & IS, CAS & Cambricon Technologies
    * Propose Language Processing Units (LPUs) as a language-centric hardware substrate for general-purpose cognitive workloads.
    * Specialize the architecture around language processing primitives to improve efficiency on language-centric tasks.

### Generative Recommenders (GRs)

* GR Serving
  * BAT: Efficient Generative Recommender Serving with Bipartite Attention \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790131)]
    * ZJU & HKU & Alibaba & NUS & Aalto University
    * Introduce bipartite attention for generative recommender serving.
    * Tailor the serving design to recommendation-style generative workloads rather than generic LLM inference.

### Diffusion Models

* Video DiT Training
  * DSV: Exploiting Dynamic Sparsity to Accelerate Large-Scale Video DiT Training \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762216)] \[[arXiv](https://arxiv.org/abs/2502.07590)]
    * CUHK & StepFun
    * Exploit dynamic sparsity to accelerate large-scale video DiT training.
    * Use hybrid sparsity-aware context parallelism to rebalance workloads under heterogeneous attention sparsity.
* Diffusion Model Serving
  * TetriServe: Efficiently Serving Mixed DiT Workloads \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790233)] \[[arXiv](https://arxiv.org/abs/2602.05116)] \[[Code](https://github.com/DiT-Serving/TetriServe)]
    * UMich & UW-Madison & NTU
    * Present a serving system for mixed DiT workloads.
    * Coordinate scheduling and batching across heterogeneous diffusion requests in a shared runtime.
* Mixture-of-Diffusion Models
  * MoDM: Efficient Serving for Image Generation via Mixture-of-Diffusion Models \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762220)] \[[Code](https://github.com/stsxxx/MoDM)]
    * UMich & Intel Labs
    * Introduce mixture-of-diffusion models for image generation serving.
    * Use specialization across diffusion sub-models to improve efficiency and quality-cost tradeoffs.

### Deep Learning Training

* T-Control: An Efficient Dynamic Tensor Rematerialization System for DNN Training \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790230)]
  * ICT, CAS
  * Present a dynamic tensor rematerialization system for DNN training.
  * Adjust rematerialization online to balance memory savings and recomputation overhead.
* NotebookOS: A Replicated Notebook Platform for Interactive Training with On-Demand GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762230)]
  * GMU & Adobe Research & UVA
  * Present a replicated notebook platform for interactive model training with on-demand GPUs.
  * Combine notebook-centric workflow support with elastic GPU provisioning.

### Deep Learning Compilation

* FuseFlow: A Fusion-Centric Compilation Framework for Sparse Deep Learning on Streaming Dataflow \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790165)]
  * Stanford & SambaNova Systems & Barcelona Supercomputing Center
  * Present a fusion-centric compilation framework for sparse deep learning on streaming dataflow hardware.
  * Expand fusion opportunities for sparse operators to improve accelerator execution efficiency.
* Trinity: Three-Dimensional Tensor Program Optimization via Tile-level Equality Saturation \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790240)]
  * KAIST & FuriosaAI
  * Introduce tile-level equality saturation for three-dimensional tensor program optimization.
  * Use equivalence-based search to discover better accelerator-friendly tensor rewrites.
* RedFuser: An Automatic Operator Fusion Framework for Cascaded Reductions on AI Accelerators \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790209)]
  * Alibaba Cloud
  * Present an automatic operator fusion framework for cascaded reductions on AI accelerators.
  * Target reduction-heavy operator chains that are poorly handled by existing compiler passes.
* Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using F2 \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762221)]
  * GMU & OpenAI
  * Introduce linear-layout abstractions for tensor code generation.
  * Improve portability and performance by reducing reliance on brittle layout-specific code generation.

### GPU Systems

* GPU Scheduling
  * gShare: Efficient GPU Sharing with Aggressive Scheduling in Multi-tenant FaaS Platform \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790168)]
    * China Telecom Cloud Computing Research Institute & China Telecom Cloud Technology Co. Ltd.
    * Present an aggressive GPU sharing and scheduling framework for multi-tenant FaaS platforms.
    * Improve utilization through fine-grained temporal multiplexing across serverless tenants.
  * GFS: A Preemption-aware Scheduling Framework for GPU Clusters with Predictive Spot Instance Management \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762231)]
    * SJTU & ZJU & Alibaba
    * Present a preemption-aware scheduling framework for GPU clusters with predictive spot instance management.
    * Jointly schedule jobs and volatile spot capacity to reduce disruption and improve cluster efficiency.
* GPU Communication
  * MSCCL++: Rethinking GPU Communication Abstractions for AI Inference \[[Paper](https://dl.acm.org/doi/10.1145/3779212.3790188)] \[[Code](https://github.com/microsoft/mscclpp)]
    * MSR & Microsoft Azure
    * Present a new GPU communication abstraction stack tailored to AI inference.
    * Move beyond training-centric collective abstractions to better support inference communication patterns.
* GPU Programming
  * cuJSON: A Highly Parallel JSON Parser for GPUs \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762222)]
    * UC Riverside
    * Introduce a highly parallel JSON parser for GPUs.
    * Make JSON parsing a scalable GPU primitive for preprocessing and data-serving pipelines.
  * CHERI-SIMT: Implementing Capability Memory Protection in GPGPUs \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762234)]
    * Cambridge
    * Implement capability-based memory protection in GPGPUs with CHERI-SIMT.
    * Bring stronger spatial memory safety and isolation to SIMT execution.
  * Lobster: A GPU-Accelerated Framework for Neurosymbolic Programming \[[Paper](https://dl.acm.org/doi/10.1145/3760250.3762232)]
    * UPenn
    * Present a GPU-accelerated framework for neurosymbolic programming.
    * Provide systems support for workloads that combine symbolic and neural computation.

### Profiling

* Deep Learning Profiling
  * DeepContext: A Context-aware, Cross-platform, and Cross-framework Tool for Performance Profiling and Analysis of Deep Learning Workloads \[[Paper](https://dl.acm.org/doi/10.1145/3676642.3736127)]
    * NCSU & GMU
    * Introduce a context-aware profiling and analysis tool for deep learning workloads across platforms and frameworks.
    * Use execution context to explain performance behavior beyond isolated kernel-level statistics.

## Acronyms

* DiT: Diffusion Transformer
* GR: Generative Recommender
* LLM: Large Language Model
* LPU: Language Processing Unit
* MoE: Mixture-of-Experts
* PIM: Processing-in-Memory
* RL: Reinforcement Learning


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/asplos-2026.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
