# HPCA 2025

## Meta Info

Homepage: <https://hpca-conf.org/2025/>

Paper list: <https://hpca-conf.org/2025/main-program/>

### Acceptance Rate

21% (= 112 / 534)

## Papers

### Large Language Models (LLMs)

* LLM Compression
  * eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models \[[arXiv](https://arxiv.org/abs/2309.00964)]
    * Apple
    * Compress LLMs to fit into storage-limited devices.
    * Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.
    * Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.
* LLM Quantization
  * BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration \[[arXiv](https://arxiv.org/abs/2411.11745)] \[[Code](https://github.com/yc2367/BitMoD-HPCA-25)]
    * Cornell & MSR & ICL
    * Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.
    * Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.
  * MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type \[[arXiv](https://arxiv.org/abs/2502.18755)]
    * SJTU
    * Assign the appropriate data type for each group adaptively.
    * Propose an efficient real-time quantization mechanism.
    * Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.
  * Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format \[[arXiv](https://arxiv.org/abs/2411.15982)]
    * NJU & MICAS KU Leuven
    * Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.
    * Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.
  * VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
    * SJTU
* Energy-Efficient LLM Inference
  * DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency \[[arXiv](https://arxiv.org/abs/2408.00741)]
    * UIUC & Microsoft Azure
    * Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).
  * throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
    * National Technical University of Athens
* Long-Context LLM Inference
  * InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference \[[arXiv](https://arxiv.org/abs/2409.04992)]
    * PKU
    * Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).
* Hardware-Assisted LLM Inference
  * LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
    * ICT, CAS
  * PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
    * Samsung SDS
  * FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
    * Seoul National University
  * Lincoln: Real-Time 50\~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
    * THU
  * Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM \[[arXiv](https://arxiv.org/abs/2502.16963)]
    * ICT, CAS
    * **Hermes**

### Diffusion Models

* EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models \[[arXiv](https://arxiv.org/abs/2501.05680)]
  * KAIST
  * Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations
  * Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration
  * A dedicated hardware architecture to support the sparsity-inducing algorithms.
* Ditto: Accelerating Diffusion Model via Temporal Value Similarity \[[arXiv](https://arxiv.org/abs/2501.11211)]
  * Yonsei University
  * Ditto: a difference processing algorithm
    * Leverage temporal similarity with quantization.
    * Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.
  * Design the Ditto hardware → a specialized hardware accelerator

### Deep Learning Recommendation Models (DLRMs)

* Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
  * UC Merced & Meta

### Dynamic Neural Networks

* Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
  * THU

### ML Cluster Reliability

* Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization \[[arXiv](https://arxiv.org/abs/2406.04594)]
  * Alibaba
* Revisiting Reliability in Large-Scale Machine Learning Research Clusters \[[arXiv](https://arxiv.org/abs/2410.21680)]
  * Meta

### ML for Systems

* The Importance of Generalizability in Machine Learning for Systems
  * MIT & Google

### ML Benchmark

* MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI \[[arXiv](https://arxiv.org/abs/2410.12032)]

### Multi-GPU Systems

* OASIS: Object-Aware Page Management for Multi-GPU Systems \[[Paper](https://users.elis.ugent.be/~leeckhou/papers/HPCA2025-OASIS.pdf)]
  * Pittsburgh & NVIDIA & Ghent

### Collective Communication

* TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology
  * KAIST
* PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
  * KAIST & Northeastern University & Boston University

### Interconnect

* Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck \[[Code](https://zenodo.org/records/14355343)]
  * HKUST-GZ & Intel & UCSD & HKUST
* EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer
  * FDU

### Compute Express Link (CXL)

* SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design \[[arXiv](https://arxiv.org/abs/2501.10682)]
  * UIUC
  * Opportunistic context switches upon the detection of long access delays.
  * Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.
  * Adaptive page migration to promote hot pages in CXL-SSD to the host.
  * Implemented with a CXL-SSD simulator.

### Near-Memory Processing

* AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
  * SJTU
* UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures \[[Code](https://github.com/UniNDP-hpca25-ae/UniNDP)]
  * THU & HKUST & PKU

### Bandwidth Partitioning

* Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
  * PKU

### Deep Learning Accelerator

* LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator \[[arXiv](https://arxiv.org/abs/2501.10658)]
  * MSRA & NTU
  * Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.
* FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
  * POSTECH & NAVER

### Image Signal Processor

* IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline
  * Universitat Politècnica de Catalunya
  * Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.
  * Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).

### Microservice

* Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility
  * University of Macau
  * Profile individual microservice latency in relation to environmental conditions.
  * Dynamically select the optimal set of microservices for scaling.
  * An end-to-end latency predictor serves as a simulator to obtain real-time feedback.

## Acronyms

* ML: Machine Learning
* DKM: Differentiable KMeans Clustering
* CXL: Compute Express Link
* CSD: Computational Storage Drive
* LUT: Look-Up Table
* ISP: Image Signal Processor
* SLAM: Simultaneous Localization and Mapping


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/hpca-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
