githubEdit

HPCA 2026

Meta Info

Homepage: https://conf.researchr.org/home/hpca-2026arrow-up-right

Paper list: https://2026.hpca-conf.org/track/hpca-2026-main-conference#event-overviewarrow-up-right

Papers

LLM

  • LLM training

    • AutoHAAP: Automated Heterogeneity-Aware Asymmetric Partitioning for LLM Training

      • Zhejiang Lab

  • LLM inference

    • AUM: Unleashing the Efficiency Potential of Shared Processors with Accelerator Units for LLM Serving [Paperarrow-up-right]

      • SJTU & Alibaba

    • GyRot: Leveraging Hidden Synergy between Rotation and Fine-grained Group Quantization for Low-bit LLM Inference

      • KAIST

    • ELORA: Efficient LoRA and KV Cache Management for Multi-LoRA LLM Serving

      • SJTU & Huawei Cloud & HKUST

    • Towards Resource-Efficient Serverless LLM Inference with SLINFER [arXivarrow-up-right]

      • SJTU

    • LILo: Harnessing the On-chip Accelerators in Intel CPUs for Compressed LLM Inference Acceleration

      • UIUC & Seoul National University & Intel

    • PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System [arXivarrow-up-right]

      • Hanyang University & SK hynix & KAIST

  • Speculative decoding

    • Adaptive Draft Sequence Length: Enhancing Speculative Decoding Throughput on PIM-Enabled Systems

      • HUST

  • Wafer

    • WATOS: Efficient LLM Training Strategies and Architecture Co-exploration for Wafer-scale Chip [arXivarrow-up-right]

      • THU

    • TEMP: A Memory Efficient Physical-aware Tensor Partition-Mapping Framework on Wafer-scale Chips [arXivarrow-up-right]

      • THU

    • FACE: Fully PD Overlapped Scheduling and Multi-Level Architecture Co-Exploration on Wafer

      • THU

    • MoEntwine: Unleashing the Potential of Wafer-scale Chips for Large-scale Expert Parallel Inference [arXivarrow-up-right]

      • THU

    • HDPAT: Hierarchical Distributed Page Address Translation for Wafer-Scale GPUs

      • William&Mary

    • ReThermal: Co-Design of Thermal-Aware Static and Dynamic Scheduling for LLM Training on Liquid-Cooled Wafer-Scale Chips

      • THU & Shanghai AI Lab

  • Quantization

    • BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache [arXivarrow-up-right]

      • Edinburgh & MSRA

    • AQPIM: Breaking the PIM Capacity Wall for LLMs with In-Memory Activation Quantization

      • Institute of Science Tokyo

  • Reasoning

    • The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective [arXivarrow-up-right]

      • KAIST

    • PASCAL: A Phase-Aware Scheduling Algorithm for Serving Reasoning-based Large Language Models

      • KAIST

    • RPU - A Reasoning Processing Unit

      • Harvard

  • RAG

    • VectorLiteRAG: Latency-Aware and Fine-Grained Resource Partitioning for Efficient RAG [arXivarrow-up-right]

      • GaTech

  • VLM

  • Video LLM

    • V-Rex: Real-Time Streaming Video LLM Acceleration via Dynamic KV Cache Retrieval [arXivarrow-up-right]

      • KAIST

  • Misc

    • Towards Compute-Aware In-Switch Computing for LLMs Tensor-Parallelism on Multi-GPU Systems

      • SJTU & Huawei

    • RoMe: Row Granularity Access Memory System for Large Language Models [arXivarrow-up-right]

      • Seoul National University & Meta

    • LEGO: Supporting LLM-enhanced Games with One Gaming GPU

      • SJTU & Tongji University

GPU

  • UVM

    • ARIADNE: Adaptive UVM Management for Efficient GPU Memory Oversubscription [Artifactarrow-up-right]

      • Yonsei University & DGIST

  • Chiplet

    • COMET: Communication and Memory Co-Design for Fine-Grained AI Inference in MCM Accelerators

      • NUDT & PKU

    • Deadlock-Free Bridge Module for Inter-Chiplet Communication in Open Chiplet Ecosystem

      • NUDT

    • LRM-GPU: Alleviating Synchronization Overhead for Multi-Chiplet GPU Architecture

      • SYSU

  • Sparsity

    • Swift: High-Performance Sparse-Dense Matrix Multiplication on GPUs

      • Hunan University

    • Uni-STC: Unified Sparse Tensor Core

      • CUP-Beijing & NUDT

  • Misc

    • QuCo: Efficient and Flexible Hardware-Driven Automatic Configuration of Tile Transfers in GPUs

      • University of Murcia & William&Mary & NVIDIA

    • μShare: Non-Intrusive Kernel Co-Locating on NVIDIA GPUs

      • TJU

    • FlashFuser: Expanding the Scale of Kernel Fusion for Compute-Intensive operators via Inter-Core Connection [arXivarrow-up-right]

      • SJTU

VAR

  • VAR-Turbo: Unlocking the Potential of Visual Autoregressive Models through Dual Redundancy

    • HKUST

Acronyms

  • LLM: Large Language Model

  • VLM: Vision-Language Model

  • RAG: Retrieval-Augmented Generation

  • UVM: Unified Virtual Memory

  • VAR: Visual AutoRegressive modeling

Last updated

Was this helpful?