HPCA 2026
Meta Info
Homepage: https://conf.researchr.org/home/hpca-2026
Paper list: https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview
Papers
LLM
LLM training
AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
Zhejiang Lab
LLM inference
AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paper]
SJTU & Alibaba
GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
KAIST
ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
SJTU & Huawei Cloud & HKUST
Towards Resource-Efficient Serverless LLM Inference with SLINFER [arXiv]
SJTU
LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
UIUC & Seoul National University & Intel
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System [arXiv]
Hanyang University & SK hynix & KAIST
Speculative decoding
Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
HUST
Wafer
WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXiv]
THU
TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips [arXiv]
THU
FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
THU
MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference [arXiv]
THU
HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
William&Mary
ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
THU & Shanghai AI Lab
Quantization
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXiv]
Edinburgh & MSRA
AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
Institute of Science Tokyo
Reasoning
The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXiv]
KAIST
PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
KAIST
RPU - A Reasoning Processing Unit
Harvard
RAG
VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXiv]
GaTech
Video LLM
V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXiv]
KAIST
Misc
Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
SJTU & Huawei
RoMe: Row Granularity Access Memory System for Large Language Models [arXiv]
Seoul National University & Meta
LEGO: Supporting LLM-enhanced Games with One Gaming GPU
SJTU & Tongji University
GPU
UVM
ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifact]
Yonsei University & DGIST
Chiplet
COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
NUDT & PKU
Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
NUDT
LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
SYSU
Sparsity
Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
Hunan University
Uni-STC: Unified Sparse Tensor Core
CUP-Beijing & NUDT
Misc
QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
University of Murcia & William&Mary & NVIDIA
μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
TJU
FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection [arXiv]
SJTU
VAR
VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
HKUST
Acronyms
LLM: Large Language Model
VLM: Vision-Language Model
RAG: Retrieval-Augmented Generation
UVM: Unified Virtual Memory
VAR: Visual AutoRegressive modeling
Last updated
Was this helpful?