# HPCA 2025

## Meta Info

Homepage: <https://hpca-conf.org/2025/>

Paper list: <https://hpca-conf.org/2025/main-program/>

### Acceptance Rate

21% (= 112 / 534)

## Papers

### Large Language Models (LLMs)

* LLM Compression
  * eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models \[[arXiv](https://arxiv.org/abs/2309.00964)]
    * Apple
    * Compress LLMs to fit into storage-limited devices.
    * Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.
    * Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.
* LLM Quantization
  * BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration \[[arXiv](https://arxiv.org/abs/2411.11745)] \[[Code](https://github.com/yc2367/BitMoD-HPCA-25)]
    * Cornell & MSR & ICL
    * Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.
    * Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.
  * MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type \[[arXiv](https://arxiv.org/abs/2502.18755)]
    * SJTU
    * Assign the appropriate data type for each group adaptively.
    * Propose an efficient real-time quantization mechanism.
    * Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.
  * Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format \[[arXiv](https://arxiv.org/abs/2411.15982)]
    * NJU & MICAS KU Leuven
    * Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.
    * Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.
  * VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
    * SJTU
* Energy-Efficient LLM Inference
  * DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency \[[arXiv](https://arxiv.org/abs/2408.00741)]
    * UIUC & Microsoft Azure
    * Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).
  * throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
    * National Technical University of Athens
* Long-Context LLM Inference
  * InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference \[[arXiv](https://arxiv.org/abs/2409.04992)]
    * PKU
    * Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).
* Hardware-Assisted LLM Inference
  * LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
    * ICT, CAS
  * PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
    * Samsung SDS
  * FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
    * Seoul National University
  * Lincoln: Real-Time 50\~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
    * THU
  * Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM \[[arXiv](https://arxiv.org/abs/2502.16963)]
    * ICT, CAS
    * **Hermes**

### Diffusion Models

* EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models \[[arXiv](https://arxiv.org/abs/2501.05680)]
  * KAIST
  * Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations
  * Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration
  * A dedicated hardware architecture to support the sparsity-inducing algorithms.
* Ditto: Accelerating Diffusion Model via Temporal Value Similarity \[[arXiv](https://arxiv.org/abs/2501.11211)]
  * Yonsei University
  * Ditto: a difference processing algorithm
    * Leverage temporal similarity with quantization.
    * Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.
  * Design the Ditto hardware → a specialized hardware accelerator

### Deep Learning Recommendation Models (DLRMs)

* Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
  * UC Merced & Meta

### Dynamic Neural Networks

* Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
  * THU

### ML Cluster Reliability

* Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization \[[arXiv](https://arxiv.org/abs/2406.04594)]
  * Alibaba
* Revisiting Reliability in Large-Scale Machine Learning Research Clusters \[[arXiv](https://arxiv.org/abs/2410.21680)]
  * Meta

### ML for Systems

* The Importance of Generalizability in Machine Learning for Systems
  * MIT & Google

### ML Benchmark

* MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI \[[arXiv](https://arxiv.org/abs/2410.12032)]

### Multi-GPU Systems

* OASIS: Object-Aware Page Management for Multi-GPU Systems \[[Paper](https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf)]
  * Pittsburgh & NVIDIA & Ghent

### Collective Communication

* TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology
  * KAIST
* PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
  * KAIST & Northeastern University & Boston University

### Interconnect

* Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck \[[Code](https://zenodo.org/records/14355343)]
  * HKUST-GZ & Intel & UCSD & HKUST
* EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer
  * FDU

### Compute Express Link (CXL)

* SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design \[[arXiv](https://arxiv.org/abs/2501.10682)]
  * UIUC
  * Opportunistic context switches upon the detection of long access delays.
  * Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.
  * Adaptive page migration to promote hot pages in CXL-SSD to the host.
  * Implemented with a CXL-SSD simulator.

### Near-Memory Processing

* AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
  * SJTU
* UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures \[[Code](https://github.com/UniNDP-hpca25-ae/UniNDP)]
  * THU & HKUST & PKU

### Bandwidth Partitioning

* Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
  * PKU

### Deep Learning Accelerator

* LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator \[[arXiv](https://arxiv.org/abs/2501.10658)]
  * MSRA & NTU
  * Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.
* FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
  * POSTECH & NAVER

### Image Signal Processor

* IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline
  * Universitat Politècnica de Catalunya
  * Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.
  * Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).

### Microservice

* Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility
  * University of Macau
  * Profile individual microservice latency in relation to environmental conditions.
  * Dynamically select the optimal set of microservices for scaling.
  * An end-to-end latency predictor serves as a simulator to obtain real-time feedback.

## Acronyms

* ML: Machine Learning
* DKM: Differentiable KMeans Clustering
* CXL: Compute Express Link
* CSD: Computational Storage Drive
* LUT: Look-Up Table
* ISP: Image Signal Processor
* SLAM: Simultaneous Localization and Mapping
