HPCA 2025
Meta Info
Homepage: https://hpca-conf.org/2025/
Paper list: https://hpca-conf.org/2025/main-program/
Acceptance Rate
21% (= 112 / 534)
Papers
Large Language Models (LLMs)
LLM Compression
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models [arXiv]
Apple
Compress LLMs to fit into storage-limited devices.
Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.
Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.
LLM Quantization
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [arXiv] [Code]
Cornell & MSR & ICL
Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.
Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.
MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type [arXiv]
SJTU
Assign the appropriate data type for each group adaptively.
Propose an efficient real-time quantization mechanism.
Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format [arXiv]
NJU & MICAS KU Leuven
Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.
Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
SJTU
Energy-Efficient LLM Inference
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency [arXiv]
UIUC & Microsoft Azure
Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).
throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
National Technical University of Athens
Long-Context LLM Inference
InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [arXiv]
PKU
Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).
Hardware-Assisted LLM Inference
LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
ICT, CAS
PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
Samsung SDS
FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
Seoul National University
Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
THU
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM [arXiv]
ICT, CAS
Hermes
Diffusion Models
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models [arXiv]
KAIST
Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations
Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration
A dedicated hardware architecture to support the sparsity-inducing algorithms.
Ditto: Accelerating Diffusion Model via Temporal Value Similarity [arXiv]
Yonsei University
Ditto: a difference processing algorithm
Leverage temporal similarity with quantization.
Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.
Design the Ditto hardware → a specialized hardware accelerator
Deep Learning Recommendation Models (DLRMs)
Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
UC Merced & Meta
Dynamic Neural Networks
Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
THU
ML Cluster Reliability
Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization [arXiv]
Alibaba
Revisiting Reliability in Large-Scale Machine Learning Research Clusters [arXiv]
Meta
ML for Systems
The Importance of Generalizability in Machine Learning for Systems
MIT & Google
ML Benchmark
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI [arXiv]
Multi-GPU Systems
OASIS: Object-Aware Page Management for Multi-GPU Systems [Paper]
Pittsburgh & NVIDIA & Ghent
Collective Communication
TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology
KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
KAIST & Northeastern University & Boston University
Interconnect
Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck [Code]
HKUST-GZ & Intel & UCSD & HKUST
EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer
FDU
Compute Express Link (CXL)
SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design [arXiv]
UIUC
Opportunistic context switches upon the detection of long access delays.
Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.
Adaptive page migration to promote hot pages in CXL-SSD to the host.
Implemented with a CXL-SSD simulator.
Near-Memory Processing
AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
SJTU
UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures [Code]
THU & HKUST & PKU
Bandwidth Partitioning
Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
PKU
Deep Learning Accelerator
LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator [arXiv]
MSRA & NTU
Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.
FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
POSTECH & NAVER
Image Signal Processor
IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline
Universitat Politècnica de Catalunya
Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.
Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).
Microservice
Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility
University of Macau
Profile individual microservice latency in relation to environmental conditions.
Dynamically select the optimal set of microservices for scaling.
An end-to-end latency predictor serves as a simulator to obtain real-time feedback.
Acronyms
ML: Machine Learning
DKM: Differentiable KMeans Clustering
CXL: Compute Express Link
CSD: Computational Storage Drive
LUT: Look-Up Table
ISP: Image Signal Processor
SLAM: Simultaneous Localization and Mapping
Last updated
Was this helpful?