HPCA 2025
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
21% (= 112 / 534)
LLM Compression
eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models []
Apple
Compress LLMs to fit into storage-limited devices.
Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.
Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.
LLM Quantization
BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [] []
Cornell & MSR & ICL
Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.
Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.
MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type []
SJTU
Assign the appropriate data type for each group adaptively.
Propose an efficient real-time quantization mechanism.
Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format []
NJU & MICAS KU Leuven
Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.
Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.
VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
SJTU
Energy-Efficient LLM Inference
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency []
UIUC & Microsoft Azure
Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).
throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving
National Technical University of Athens
Long-Context LLM Inference
InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference []
PKU
Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).
Hardware-Assisted LLM Inference
LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding
ICT, CAS
PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM
Samsung SDS
FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference
Seoul National University
Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory
THU
Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM []
ICT, CAS
Hermes
KAIST
Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations
Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration
A dedicated hardware architecture to support the sparsity-inducing algorithms.
Yonsei University
Ditto: a difference processing algorithm
Leverage temporal similarity with quantization.
Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.
Design the Ditto hardware → a specialized hardware accelerator
Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory
UC Merced & Meta
Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling
THU
Alibaba
Meta
The Importance of Generalizability in Machine Learning for Systems
MIT & Google
Pittsburgh & NVIDIA & Ghent
TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology
KAIST
PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
KAIST & Northeastern University & Boston University
HKUST-GZ & Intel & UCSD & HKUST
EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer
FDU
UIUC
Opportunistic context switches upon the detection of long access delays.
Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.
Adaptive page migration to promote hot pages in CXL-SSD to the host.
Implemented with a CXL-SSD simulator.
AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing
SJTU
THU & HKUST & PKU
Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications
PKU
MSRA & NTU
Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.
FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables
POSTECH & NAVER
IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline
Universitat Politècnica de Catalunya
Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.
Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).
Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility
University of Macau
Profile individual microservice latency in relation to environmental conditions.
Dynamically select the optimal set of microservices for scaling.
An end-to-end latency predictor serves as a simulator to obtain real-time feedback.
ML: Machine Learning
DKM: Differentiable KMeans Clustering
CXL: Compute Express Link
CSD: Computational Storage Drive
LUT: Look-Up Table
ISP: Image Signal Processor
SLAM: Simultaneous Localization and Mapping
EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models []
Ditto: Accelerating Diffusion Model via Temporal Value Similarity []
Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization []
Revisiting Reliability in Large-Scale Machine Learning Research Clusters []
MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI []
OASIS: Object-Aware Page Management for Multi-GPU Systems []
Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck []
SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design []
UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures []
LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator []