# HPCA 2026

## Meta Info

Homepage: <https://conf.researchr.org/home/hpca-2026>

Paper list: <https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview>

## Papers

### LLM

* LLM training
  * AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
    * Zhejiang Lab
* LLM inference
  * AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving \[[Paper](https://www.cs.sjtu.edu.cn/~lichao/publications/AUM_Unleashing_HPCA-2026-Wang.pdf)]
    * SJTU & Alibaba
  * GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
    * KAIST
  * ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
    * SJTU & Huawei Cloud & HKUST
  * Towards Resource-Efficient Serverless LLM Inference with SLINFER \[[arXiv](https://arxiv.org/abs/2507.00507)]
    * SJTU
  * LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
    * UIUC & Seoul National University & Intel
  * PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System \[[arXiv](https://arxiv.org/abs/2412.20166)]
    * Hanyang University & SK hynix & KAIST
* Speculative decoding
  * Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
    * HUST
* Wafer
  * WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip \[[arXiv](https://arxiv.org/abs/2512.12279)]
    * THU
  * TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips \[[arXiv](https://arxiv.org/abs/2512.14256)]
    * THU
  * FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
    * THU
  * MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference \[[arXiv](https://arxiv.org/abs/2510.25258)]
    * THU
  * HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
    * William\&Mary
  * ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
    * THU & Shanghai AI Lab
* Quantization
  * BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache \[[arXiv](https://arxiv.org/abs/2503.18773)]
    * Edinburgh & MSRA
  * AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
    * Institute of Science Tokyo
* Reasoning
  * The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective \[[arXiv](https://arxiv.org/abs/2506.04301)]
    * KAIST
  * PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
    * KAIST
  * RPU - A Reasoning Processing Unit
    * Harvard
* RAG
  * VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG \[[arXiv](https://arxiv.org/abs/2504.08930)]
    * GaTech
* VLM
  * Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models \[[arXiv](https://arxiv.org/abs/2512.14661)] \[[Code](https://github.com/dubcyfor3/Focus)]
    * Duke
* Video LLM
  * V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval \[[arXiv](https://arxiv.org/abs/2512.12284)]
    * KAIST
* Misc
  * Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
    * SJTU & Huawei
  * RoMe: Row Granularity Access Memory System for Large Language Models \[[arXiv](https://arxiv.org/abs/2512.01541)]
    * Seoul National University & Meta
  * LEGO: Supporting LLM-enhanced Games with One Gaming GPU
    * SJTU & Tongji University

### GPU

* UVM
  * ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription \[[Artifact](https://zenodo.org/records/17852674)]
    * Yonsei University & DGIST
* Chiplet
  * COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
    * NUDT & PKU
  * Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
    * NUDT
  * LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
    * SYSU
* Sparsity
  * Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
    * Hunan University
  * Uni-STC: Unified Sparse Tensor Core
    * CUP-Beijing & NUDT
* Misc
  * QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
    * University of Murcia & William\&Mary & NVIDIA
  * μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
    * TJU
  * FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection \[[arXiv](https://arxiv.org/abs/2512.12949)]
    * SJTU

### VAR

* VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
  * HKUST

## Acronyms

* LLM: Large Language Model
* VLM: Vision-Language Model
* RAG: Retrieval-Augmented Generation
* UVM: Unified Virtual Memory
* VAR: Visual AutoRegressive modeling
