HPCA 2026

Meta Info

Homepage: https://conf.researchr.org/home/hpca-2026

Paper list: https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overview

Papers

LLM

LLM training
- AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training
  - Zhejiang Lab
LLM inference
- AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paper]
  - SJTU & Alibaba
- GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference
  - KAIST
- ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving
  - SJTU & Huawei Cloud & HKUST
- Towards Resource-Efficient Serverless LLM Inference with SLINFER [arXiv]
  - SJTU
- LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration
  - UIUC & Seoul National University & Intel
- PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System [arXiv]
  - Hanyang University & SK hynix & KAIST
Speculative decoding
- Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems
  - HUST
Wafer
- WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXiv]
  - THU
- TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips [arXiv]
  - THU
- FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer
  - THU
- MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference [arXiv]
  - THU
- HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs
  - William&Mary
- ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips
  - THU & Shanghai AI Lab
Quantization
- BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXiv]
  - Edinburgh & MSRA
- AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization
  - Institute of Science Tokyo
Reasoning
- The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXiv]
  - KAIST
- PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models
  - KAIST
- RPU - A Reasoning Processing Unit
  - Harvard
RAG
- VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXiv]
  - GaTech
VLM
- Focus: A Streaming Concentration Architecture for Efficient Vision-Language Models [arXiv] [Code]
  - Duke
Video LLM
- V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXiv]
  - KAIST
Misc
- Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems
  - SJTU & Huawei
- RoMe: Row Granularity Access Memory System for Large Language Models [arXiv]
  - Seoul National University & Meta
- LEGO: Supporting LLM-enhanced Games with One Gaming GPU
  - SJTU & Tongji University

GPU

UVM
- ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifact]
  - Yonsei University & DGIST
Chiplet
- COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators
  - NUDT & PKU
- Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem
  - NUDT
- LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture
  - SYSU
Sparsity
- Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs
  - Hunan University
- Uni-STC: Unified Sparse Tensor Core
  - CUP-Beijing & NUDT
Misc
- QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs
  - University of Murcia & William&Mary & NVIDIA
- μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs
  - TJU
- FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection [arXiv]
  - SJTU

VAR

VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy
- HKUST

Acronyms

LLM: Large Language Model
VLM: Vision-Language Model
RAG: Retrieval-Augmented Generation
UVM: Unified Virtual Memory
VAR: Visual AutoRegressive modeling

Last updated 2 months ago

hashtagMeta Info

hashtagPapers

hashtagLLM

hashtagGPU

hashtagVAR

hashtagAcronyms

Meta Info

Papers

LLM

GPU

VAR

Acronyms