📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • Meta Info
  • Acceptance Rate
  • Papers
  • Large Language Models (LLMs)
  • Diffusion Models
  • Deep Learning Recommendation Models (DLRMs)
  • Dynamic Neural Networks
  • ML Cluster Reliability
  • ML for Systems
  • ML Benchmark
  • Multi-GPU Systems
  • Collective Communication
  • Interconnect
  • Compute Express Link (CXL)
  • Near-Memory Processing
  • Bandwidth Partitioning
  • Deep Learning Accelerator
  • Image Signal Processor
  • Microservice
  • Acronyms

Was this helpful?

Edit on GitHub
  1. Reading Notes
  2. Conference

HPCA 2025

Last updated 3 months ago

Was this helpful?

Meta Info

Homepage:

Paper list:

Acceptance Rate

21% (= 112 / 534)

Papers

Large Language Models (LLMs)

  • LLM Compression

    • eDKM: An Efficient and Accurate Train-Time Weight Clustering for Large Language Models []

      • Apple

      • Compress LLMs to fit into storage-limited devices.

      • Propose a memory-efficient DKM (Differentiable KMeans Clustering) implementation.

      • Fine-tune and compress a pretrained LLaMA 7B model from 12.6 GB to 2.5 GB.

  • LLM Quantization

    • BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration [] []

      • Cornell & MSR & ICL

      • Algorithm: Fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights.

      • Hardware: Employ a bit-serial processing element to support multiple numerical precisions and data types.

    • MANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type []

      • SJTU

      • Assign the appropriate data type for each group adaptively.

      • Propose an efficient real-time quantization mechanism.

      • Implement a specific processing element to efficiently support MANT and incorporate a real-time quantization unit.

    • Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format []

      • NJU & MICAS KU Leuven

      • Investigate the sensitivity of activation precision across various LLM modules and its impact on overall model accuracy.

      • Anda data type: an adaptive data format with group-shared exponent bits and dynamic mantissa bit allocation.

    • VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference

      • SJTU

  • Energy-Efficient LLM Inference

    • DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency []

      • UIUC & Microsoft Azure

      • Given the current load and available resources, select the energy-optimized configuration (e.g., different model parallelisms, GPU frequencies).

    • throttLL’eM: Predictive GPU Throttling for Energy Efficient LLM Inference Serving

      • National Technical University of Athens

  • Long-Context LLM Inference

    • InstAttention: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference []

      • PKU

      • Offload the attention in decoding phase KV cache to Computational Storage Drives (CSDs).

  • Hardware-Assisted LLM Inference

    • LAD: Efficient Accelerator for Generative Inference of LLM with Locality Aware Decoding

      • ICT, CAS

    • PAISE: PIM-Accelerated Inference Scheduling Engine for Transformer-based LLM

      • Samsung SDS

    • FACIL: Flexible DRAM Address Mapping for SoC-PIM Cooperative On-device LLM Inference

      • Seoul National University

    • Lincoln: Real-Time 50~100B LLM Inference on Consumer Devices with LPDDR-Interfaced, Compute-Enabled Flash Memory

      • THU

    • Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM []

      • ICT, CAS

      • Hermes

Diffusion Models

    • KAIST

    • Inter-iteration sparsity → a FFN-Reuse algorithm to identify and skip redundant computations in FFN layers across different iterations

    • Intra-iteration sparsity → A modified eager prediction method to accurately predict the attention score, skipping unnecessary computations within an iteration

    • A dedicated hardware architecture to support the sparsity-inducing algorithms.

    • Yonsei University

    • Ditto: a difference processing algorithm

      • Leverage temporal similarity with quantization.

      • Perform full bit-width operations for the initial time step and process subsequent steps with temporal differences.

    • Design the Ditto hardware → a specialized hardware accelerator

Deep Learning Recommendation Models (DLRMs)

  • Machine Learning-Guided Memory Optimization for DLRM Inference on Tiered Memory

    • UC Merced & Meta

Dynamic Neural Networks

  • Adyna: Accelerating Dynamic Neural Networks with Adaptive Scheduling

    • THU

ML Cluster Reliability

    • Alibaba

    • Meta

ML for Systems

  • The Importance of Generalizability in Machine Learning for Systems

    • MIT & Google

ML Benchmark

Multi-GPU Systems

    • Pittsburgh & NVIDIA & Ghent

Collective Communication

  • TidalMesh: Topology-Driven AllReduce Collective Communication for Mesh Topology

    • KAIST

  • PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM

    • KAIST & Northeastern University & Boston University

Interconnect

    • HKUST-GZ & Intel & UCSD & HKUST

  • EIGEN: Enabling Efficient 3DIC Interconnect with Heterogeneous Dual-Layer Network-on-Active-Interposer

    • FDU

Compute Express Link (CXL)

    • UIUC

    • Opportunistic context switches upon the detection of long access delays.

    • Data coalescing upon log cleaning to reduce the I/O traffic to flash chips.

    • Adaptive page migration to promote hot pages in CXL-SSD to the host.

    • Implemented with a CXL-SSD simulator.

Near-Memory Processing

  • AsyncDIMM: Achieving Asynchronous Execution in DIMM-Based Near-Memory Processing

    • SJTU

    • THU & HKUST & PKU

Bandwidth Partitioning

  • Criticality-Aware Instruction-Centric Bandwidth Partitioning for Data Center Applications

    • PKU

Deep Learning Accelerator

    • MSRA & NTU

    • Transform various DNN models into LUTs (Look-Up Tables) via multistage training to achieve extreme low-bit quantization.

  • FIGLUT: An Energy-Efficient Accelerator Design for FP-INT GEMM Using Look-Up Tables

    • POSTECH & NAVER

Image Signal Processor

  • IRIS: Unleashing ISP-Software Cooperation to Optimize the Machine Vision Pipeline

    • Universitat Politècnica de Catalunya

    • Enable the Image Signal Processor (ISP) to create compressed, mixed-resolution images in real time.

    • Used for image classification & monocular Simultaneous Localization and Mapping (SLAM).

Microservice

  • Grad: Intelligent Microservice Scaling by Harnessing Resource Fungibility

    • University of Macau

    • Profile individual microservice latency in relation to environmental conditions.

    • Dynamically select the optimal set of microservices for scaling.

    • An end-to-end latency predictor serves as a simulator to obtain real-time feedback.

Acronyms

  • ML: Machine Learning

  • DKM: Differentiable KMeans Clustering

  • CXL: Compute Express Link

  • CSD: Computational Storage Drive

  • LUT: Look-Up Table

  • ISP: Image Signal Processor

  • SLAM: Simultaneous Localization and Mapping

EXION: Exploiting Inter- and Intra-Iteration Output Sparsity for Diffusion Models []

Ditto: Accelerating Diffusion Model via Temporal Value Similarity []

Enhancing Large-Scale AI Training Efficiency: The C4 Solution for Real-Time Anomaly Detection and Communication Optimization []

Revisiting Reliability in Large-Scale Machine Learning Research Clusters []

MLPerf Power: Benchmarking the Energy Efficiency of Machine Learning Systems from μWatts to MWatts for Sustainable AI []

OASIS: Object-Aware Page Management for Multi-GPU Systems []

Push Multicast: A Speculative and Coherent Interconnect for Mitigating Manycore CPU Communication Bottleneck []

SkyByte: Architecting An Efficient Memory-Semantic CXL-based SSD with OS and Hardware Co-design []

UniNDP: A Unified Compilation and Simulation Tool for Near DRAM Processing Architectures []

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator []

https://hpca-conf.org/2025/
https://hpca-conf.org/2025/main-program/
arXiv
arXiv
Code
arXiv
arXiv
arXiv
arXiv
arXiv
arXiv
arXiv
arXiv
arXiv
arXiv
Paper
Code
arXiv
Code
arXiv