HPCA 2025

Meta Info

Homepage: https://hpca-conf.org/2025/

Paper list: https://hpca-conf.org/2025/main-program/

Acceptance Rate

21% (= 112 / 534)

Papers

Large Language Models (LLMs)

  • LLM Compression

    • eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models [arXiv]

      • Apple

      • Compress LLMs to fit into storage-limited devices.

      • Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.

      • Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.

  • LLM Quantization

    • BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [arXiv] [Code]

      • Cornell & MSR & ICL

      • Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.

      • Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.

    • MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type [arXiv]

      • SJTU

      • Assign the appropriate data type for each group adaptively.

      • Propose an efficient real-time quantization mechanism.

      • Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.

    • Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format [arXiv]

      • NJU & MICAS KU Leuven

      • Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.

      • Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.

    • VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

      • SJTU

  • Energy-Efficient LLM Inference

    • DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency [arXiv]

      • UIUC & Microsoft Azure

      • Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).

    • throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving

      • National Technical University of Athens

  • Long-Context LLM Inference

    • InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference [arXiv]

      • PKU

      • Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).

  • Hardware-Assisted LLM Inference

    • LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding

      • ICT, CAS

    • PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM

      • Samsung SDS

    • FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference

      • Seoul National University

    • Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory

      • THU

    • Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM [arXiv]

      • ICT, CAS

      • Hermes

Diffusion Models

  • EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models [arXiv]

    • KAIST

    • Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations

    • Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration

    • A dedicated hardware architecture to support the sparsity-inducing algorithms.

  • Ditto: Accelerating Diffusion Model via Temporal Value Similarity [arXiv]

    • Yonsei University

    • Ditto: a difference processing algorithm

      • Leverage temporal similarity with quantization.

      • Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.

    • Design the Ditto hardware → a specialized hardware accelerator

Deep Learning Recommendation Models (DLRMs)

  • Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

    • UC Merced & Meta

Dynamic Neural Networks

  • Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling

    • THU

ML Cluster Reliability

  • Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization [arXiv]

    • Alibaba

  • Revisiting Reliability in Large-Scale Machine Learning Research Clusters [arXiv]

    • Meta

ML for Systems

  • The Importance of Generalizability in Machine Learning for Systems

    • MIT & Google

ML Benchmark

  • MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI [arXiv]

Multi-GPU Systems

  • OASIS: Object-Aware Page Management for Multi-GPU Systems [Paper]

    • Pittsburgh & NVIDIA & Ghent

Collective Communication

  • TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology

    • KAIST

  • PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM

    • KAIST & Northeastern University & Boston University

Interconnect

  • Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck [Code]

    • HKUST-GZ & Intel & UCSD & HKUST

  • EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer

    • FDU

  • SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design [arXiv]

    • UIUC

    • Opportunistic context switches upon the detection of long access delays.

    • Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.

    • Adaptive page migration to promote hot pages in CXL-SSD to the host.

    • Implemented with a CXL-SSD simulator.

Near-Memory Processing

  • AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing

    • SJTU

  • UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures [Code]

    • THU & HKUST & PKU

Bandwidth Partitioning

  • Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications

    • PKU

Deep Learning Accelerator

  • LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator [arXiv]

    • MSRA & NTU

    • Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.

  • FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables

    • POSTECH & NAVER

Image Signal Processor

  • IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline

    • Universitat Politècnica de Catalunya

    • Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.

    • Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).

Microservice

  • Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility

    • University of Macau

    • Profile individual microservice latency in relation to environmental conditions.

    • Dynamically select the optimal set of microservices for scaling.

    • An end-to-end latency predictor serves as a simulator to obtain real-time feedback.

Acronyms

  • ML: Machine Learning

  • DKM: Differentiable KMeans Clustering

  • CXL: Compute Express Link

  • CSD: Computational Storage Drive

  • LUT: Look-Up Table

  • ISP: Image Signal Processor

  • SLAM: Simultaneous Localization and Mapping

Last updated

Was this helpful?