# HPCA 2026

## Meta Info

Homepage: <https://conf.researchr.org/home/hpca-2026>

Paper list: <https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview>

## Papers

### LLM

* LLM training
  * AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
    * Zhejiang Lab
* LLM inference
  * AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving \[[Paper](https://www.cs.sjtu.edu.cn/~lichao/publications/AUM_Unleashing_HPCA-2026-Wang.pdf)]
    * SJTU & Alibaba
  * GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
    * KAIST
  * ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
    * SJTU & Huawei Cloud & HKUST
  * Towards Resource-Efficient Serverless LLM Inference with SLINFER \[[arXiv](https://arxiv.org/abs/2507.00507)]
    * SJTU
  * LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
    * UIUC & Seoul National University & Intel
  * PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System \[[arXiv](https://arxiv.org/abs/2412.20166)]
    * Hanyang University & SK hynix & KAIST
* Speculative decoding
  * Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
    * HUST
* Wafer
  * WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip \[[arXiv](https://arxiv.org/abs/2512.12279)]
    * THU
  * TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips \[[arXiv](https://arxiv.org/abs/2512.14256)]
    * THU
  * FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
    * THU
  * MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference \[[arXiv](https://arxiv.org/abs/2510.25258)]
    * THU
  * HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
    * William\&Mary
  * ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
    * THU & Shanghai AI Lab
* Quantization
  * BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache \[[arXiv](https://arxiv.org/abs/2503.18773)]
    * Edinburgh & MSRA
  * AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
    * Institute of Science Tokyo
* Reasoning
  * The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective \[[arXiv](https://arxiv.org/abs/2506.04301)]
    * KAIST
  * PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
    * KAIST
  * RPU - A Reasoning Processing Unit
    * Harvard
* RAG
  * VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG \[[arXiv](https://arxiv.org/abs/2504.08930)]
    * GaTech
* VLM
  * Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models \[[arXiv](https://arxiv.org/abs/2512.14661)] \[[Code](https://github.com/dubcyfor3/Focus)]
    * Duke
* Video LLM
  * V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval \[[arXiv](https://arxiv.org/abs/2512.12284)]
    * KAIST
* Misc
  * Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
    * SJTU & Huawei
  * RoMe: Row Granularity Access Memory System for Large Language Models \[[arXiv](https://arxiv.org/abs/2512.01541)]
    * Seoul National University & Meta
  * LEGO: Supporting LLM-enhanced Games with One Gaming GPU
    * SJTU & Tongji University

### GPU

* UVM
  * ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription \[[Artifact](https://zenodo.org/records/17852674)]
    * Yonsei University & DGIST
* Chiplet
  * COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
    * NUDT & PKU
  * Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
    * NUDT
  * LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
    * SYSU
* Sparsity
  * Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
    * Hunan University
  * Uni-STC: Unified Sparse Tensor Core
    * CUP-Beijing & NUDT
* Misc
  * QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
    * University of Murcia & William\&Mary & NVIDIA
  * μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
    * TJU
  * FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection \[[arXiv](https://arxiv.org/abs/2512.12949)]
    * SJTU

### VAR

* VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
  * HKUST

## Acronyms

* LLM: Large Language Model
* VLM: Vision-Language Model
* RAG: Retrieval-Augmented Generation
* UVM: Unified Virtual Memory
* VAR: Visual AutoRegressive modeling


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/hpca-2026.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
