ASPLOS 2025

Meta Info

Homepage: https://www.asplos-conference.org/asplos2025/

Proceedings

Acceptance Rate

  • Overall: 17.5% (= 160 / 912)

  • Fall: 14.1% (= 46 / 326)

    • Major Revision: 20 (invited) -> pending

  • Summer: 12.7% (= 65 / 510)

    • Major Revision: 42 (invited) -> 40 (accepted)

  • Spring: 2.6% (= 2 / 76)

    • Major Revision: 7 (invited) -> 7 (accepted)

Papers

Large Language Models (LLMs)

  • LLM Training

    • FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism

      • PKU & ByteDance

    • Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling

      • PKU & Alibaba

    • Vela: A Virtualized LLM Training System with GPU Direct RoCE

      • IBM Research

    • Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning

      • NUS & GaTech & Alibaba & GMU & SYSU

  • LLM Inference

    • Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow [arXiv] [Code]

      • CMU

    • Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management

      • Korea University

    • COMET: Towards Practical W4A4KV4 LLMs Serving

      • ICT, CAS

    • Past-Future Scheduler for LLM Serving under SLA Guarantees

      • Beihang University & SenseTime

    • POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

      • UW & MSR India

    • Medusa: Accelerating Serverless LLM Inference with Materialization

      • THU

    • vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [arXiv]

      • MSR India & IIS

    • TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms

      • UIUC & Microsoft Azure Research

    • PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

      • ICT, CAS & ETH & UofT & NVIDIA

    • PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference

      • UMich & ETH & Google

    • Fast On-device LLM Inference with NPUs

      • PKU & BUPT

  • LLM-based Applications

    • Towards End-to-End Optimization of LLM-based Applications with Ayo

      • CUHK

  • MoE Training

    • FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models

      • HKUST-GZ & HKUST & HIT-SZ

    • MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

      • HKUST-GZ

  • MoE Inference

    • MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs

      • UC Berkeley

    • Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline

      • SYSU & HKUST & Huawei & Peng Cheng Laboratory

  • Retrieval-Augmented Generation (RAG)

    • Accelerating Retrieval-Augmented Generation

      • Cornell & Kansas & UMass Amherst & Samsung Electronics

  • Security

    • PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption

      • SJTU IPADS

  • Coarse-Grained Reconfigurable Array (CGRA)

    • PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs

      • NYU

Deep Learning Recommendation Models (DLRMs)

  • DLRM Training

    • Frugal: Efficient and Economic Embedding Model Training with Commodity GPUs

      • THU

  • DLRM Inference

    • Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs

      • Penn State & AMD

Resource Management

  • Shared ML Clusters

    • Design and Operation of Shared Machine Learning Clusters on Campus

      • HKUST

  • Resource Oversubscription

    • Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms

      • Microsoft

  • Serverless Computing

    • Litmus: Fair Pricing for Serverless Computing

      • Binghamton & Intel Lab

    • Concurrency-Informed Orchestration for Serverless Functions

      • UVA & Alibaba & Amazon

    • Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity

      • ICT, CAS

  • Graceful Degradation

    • Cooperative Graceful Degradation in Containerized Clouds [arXiv]

      • UC Irvine

  • Microservice

    • Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters

      • University of Macau

Deep Learning Compilation

  • Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation

    • ICT, CAS & Tencent

  • Einsum Trees: An Abstraction for Optimizing the Execution of Tensor Expressions

    • Friedrich Schiller University Jena

  • Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

    • CMU & NVIDIA

  • Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning

    • USTC & NIO

  • Optimizing Deep Learning Inference Efficiency through Block Dependency Analysis

    • ICT, CAS

  • Composing Distributed Computations Through Task and Kernel Fusion [arXiv]

    • Stanford & NVIDIA

Parallelism

  • PartIR: Composing SPMD Partitioning Strategies for Machine Learning [arXiv]

    • Google DeepMind

  • GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism [arXiv]

    • NVIDIA & CMU & UC Berkeley & MIT

GPU Sharing

  • Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]

    • Stanford & UofT

Performance Prediction

  • Forecasting GPU Performance for Deep Learning Training and Inference

    • GaTech & Meta

Checkpointing

  • PCcheck: Persistent Concurrent Checkpointing for ML

    • ETH

Memory Disaggregation

  • pulse: Accelerating Distributed Pointer-Traversals on Disaggregated Memory

    • Yale

  • EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation

    • Purdue

Acceleration

  • DynaX: Sparse Attention Acceleration with Dynamic X:M Fine-Grained Structured Pruning

    • CQU

  • GUST: Graph Edge-Coloring Utilization for Accelerating Sparse Matrix Vector Multiplication

    • UMD

  • RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling

    • GaTech

  • Squeezing Operator Performance Potential for the Ascend Architecture

    • NJU & Huawei

Performance Tuning

  • DarwinGame: Playing Tournaments for Tuning Applications in Noisy Cloud Environments [Paper]

    • Utah & MIT & NEU

DPU Offloading

  • OS2G: A High-Performance DPU Offloading Architecture for GPU-based Deep Learning with Object Storage

    • Alibaba & ZJU

Tracing

  • Automatic Tracing in Task-Based Runtime Systems

    • Stanford & NVIDIA

Last updated

Was this helpful?