HPCA 2025

Meta Info

Homepage: https://hpca-conf.org/2025/

Paper list: https://hpca-conf.org/2025/main-program/

Acceptance Rate

21% (= 112 / 534)

Papers

Large Language Models (LLMs)

LLM Compression
- eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models [arXiv]
  - Apple
  - Compress LLMs to fit into storage-limited devices.
  - Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.
  - Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.
LLM Quantization
- BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [arXiv] [Code]
  - Cornell & MSR & ICL
  - Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.
  - Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.
- MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type [arXiv]
  - SJTU
  - Assign the appropriate data type for each group adaptively.
  - Propose an efficient real-time quantization mechanism.
  - Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.
- Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format [arXiv]
  - NJU & MICAS KU Leuven
  - Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.
  - Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.
- VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
  - SJTU
Energy-Efficient LLM Inference
- DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency [arXiv]
  - UIUC & Microsoft Azure
  - Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).
- throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
  - National Technical University of Athens
Long-Context LLM Inference
- InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [arXiv]
  - PKU
  - Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).
Hardware-Assisted LLM Inference
- LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
  - ICT, CAS
- PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
  - Samsung SDS
- FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
  - Seoul National University
- Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
  - THU
- Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM [arXiv]
  - ICT, CAS
  - Hermes

Diffusion Models

EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models [arXiv]
- KAIST
- Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations
- Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration
- A dedicated hardware architecture to support the sparsity-inducing algorithms.
Ditto: Accelerating Diffusion Model via Temporal Value Similarity [arXiv]
- Yonsei University
- Ditto: a difference processing algorithm
  - Leverage temporal similarity with quantization.
  - Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.
- Design the Ditto hardware → a specialized hardware accelerator

Deep Learning Recommendation Models (DLRMs)

Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
- UC Merced & Meta

Dynamic Neural Networks

Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
- THU

ML Cluster Reliability

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization [arXiv]
- Alibaba
Revisiting Reliability in Large-Scale Machine Learning Research Clusters [arXiv]
- Meta

ML for Systems

The Importance of Generalizability in Machine Learning for Systems
- MIT & Google

ML Benchmark

MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI [arXiv]

Multi-GPU Systems

OASIS: Object-Aware Page Management for Multi-GPU Systems [Paper]
- Pittsburgh & NVIDIA & Ghent

Collective Communication

TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology
- KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
- KAIST & Northeastern University & Boston University

Interconnect

Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck [Code]
- HKUST-GZ & Intel & UCSD & HKUST
EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer
- FDU

Compute Express Link (CXL)

SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design [arXiv]
- UIUC
- Opportunistic context switches upon the detection of long access delays.
- Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.
- Adaptive page migration to promote hot pages in CXL-SSD to the host.
- Implemented with a CXL-SSD simulator.

Near-Memory Processing

AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
- SJTU
UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures [Code]
- THU & HKUST & PKU

Bandwidth Partitioning

Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
- PKU

Deep Learning Accelerator

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator [arXiv]
- MSRA & NTU
- Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.
FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
- POSTECH & NAVER

Image Signal Processor

IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline
- Universitat Politècnica de Catalunya
- Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.
- Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).

Microservice

Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility
- University of Macau
- Profile individual microservice latency in relation to environmental conditions.
- Dynamically select the optimal set of microservices for scaling.
- An end-to-end latency predictor serves as a simulator to obtain real-time feedback.

Acronyms

ML: Machine Learning
DKM: Differentiable KMeans Clustering
CXL: Compute Express Link
CSD: Computational Storage Drive
LUT: Look-Up Table
ISP: Image Signal Processor
SLAM: Simultaneous Localization and Mapping

Last updated 8 months ago

Was this helpful?