ASPLOS 2025
Meta Info
Homepage: https://www.asplos-conference.org/asplos2025/
Acceptance Rate
Summer: 12.7% (65 / 510)
Papers
Large Language Models (LLMs)
LLM Training
FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism
PKU & ByteDance
Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront Scheduling
PKU & Alibaba
Vela: A Virtualized LLM Training System with GPU Direct RoCE
IBM Research
Concerto: Automatic Communication Optimization and Scheduling for Large-Scale Deep Learning
NUS & GaTech & Alibaba & GMU & SYSU
LLM Inference
Accelerating LLM Serving for Multi-turn Dialogues with Efficient Resource Management
Korea University
COMET: Towards Practical W4A4KV4 LLMs Serving
ICT, CAS
Past-Future Scheduler for LLM Serving under SLA Guarantees
Beihang University & SenseTime
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
UW & MSR India
Medusa: Accelerating Serverless LLM Inference with Materialization
THU
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention [arXiv]
MSR India & IIS
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
UIUC & Microsoft Azure Research
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
ICT, CAS & ETH & UofT & NVIDIA
PIM is All You Need: A CXL-Enabled GPU-Free System for LLM Inference
UMich & ETH & Google
Fast On-device LLM Inference with NPUs
PKU & BUPT
LLM-based Applications
Towards End-to-End Optimization of LLM-based Applications with Ayo
CUHK
MoE Training
FSMoE: A Flexible and Scalable Training System for Sparse Mixture-of-Experts Models
HKUST-GZ & HKUST & HIT-SZ
MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
HKUST-GZ
MoE Inference
MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs
UC Berkeley
Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline
SYSU & HKUST & Huawei & Peng Cheng Laboratory
Retrieval-Augmented Generation (RAG)
Accelerating Retrieval-Augmented Generation
Cornell & Kansas & UMass Amherst & Samsung Electronics
Security
PipeLLM: Fast and Confidential Large Language Model Services with Speculative Pipelined Encryption’
SJTU IPADS
Coarse-Grained Reconfigurable Array (CGRA)
PICACHU: Plug-In CGRA Handling Upcoming Nonlinear Operations in LLMs
NYU
Deep Learning Recommendation Models (DLRMs)
DLRM Training
Frugal: Efficient and Economic Embedding Model Training with Commodity GPUs
THU
DLRM Inference
Load and MLP-Aware Thread Orchestration for Recommendation Systems Inference on CPUs
Penn State & AMD
Resource Management
Shared ML Clusters
Design and Operation of Shared Machine Learning Clusters on Campus
HKUST
Resource Oversubscription
Coach: Exploiting Temporal Patterns for All-Resource Oversubscription in Cloud Platforms
Microsoft
Serverless Computing
Litmus: Fair Pricing for Serverless Computing
Binghamton & Intel Lab
Concurrency-Informed Orchestration for Serverless Functions
UVA & Alibaba & Amazon
Dilu: Enabling GPU Resourcing-on-Demand for Serverless DL Serving via Introspective Elasticity
ICT, CAS
Graceful Degradation
Cooperative Graceful Degradation in Containerized Clouds [arXiv]
UC Irvine
Microservice
Embracing Imbalance: Dynamic Load Shifting among Microservice Containers in Shared Clusters
University of Macau
Deep Learning Compilation
Mosaic: Exploiting Instruction-Level Parallelism on Deep Learning Accelerators with iTex Tessellation
ICT, CAS & Tencent
Einsum Trees: An Abstraction for Optimizing the Execution of Tensor Expressions
Friedrich Schiller University Jena
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
CMU & NVIDIA
Pruner: A Draft-then-Verify Exploration Mechanism to Accelerate Tensor Program Tuning
USTC & NIO
Optimizing Deep Learning Inference Efficiency through Block Dependency Analysis
ICT, CAS
Composing Distributed Computations Through Task and Kernel Fusion [arXiv]
Stanford & NVIDIA
Parallelism
PartIR: Composing SPMD Partitioning Strategies for Machine Learning [arXiv]
Google DeepMind
GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism [arXiv]
NVIDIA & CMU & UC Berkeley & MIT
GPU Sharing
Tally: Non-Intrusive Performance Isolation for Concurrent Deep Learning Workloads [arXiv]
Stanford & UofT
Performance Prediction
Forecasting GPU Performance for Deep Learning Training and Inference
GaTech & Meta
Checkpointing
PCcheck: Persistent Concurrent Checkpointing for ML
ETH
Memory Disaggregation
pulse: Accelerating Distributed Pointer-Traversals on Disaggregated Memory
Yale
EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation
Purdue
Acceleration
DynaX: Sparse Attention Acceleration with Dynamic X:M Fine-Grained Structured Pruning
CQU
GUST: Graph Edge-Coloring Utilization for Accelerating Sparse Matrix Vector Multiplication
UMD
RASSM: Residue-based Acceleration of Single Sparse Matrix Computation via Adaptive Tiling
GaTech
Squeezing Operator Performance Potential for the Ascend Architecture
NJU & Huawei
Performance Tuning
DarwinGame: Playing Tournaments for Tuning Applications in Noisy Cloud Environments [Paper]
Utah & MIT & NEU
DPU Offloading
OS2G: A High-Performance DPU Offloading Architecture for GPU-based Deep Learning with Object Storage
Alibaba & ZJU
Tracing
Automatic Tracing in Task-Based Runtime Systems
Stanford & NVIDIA
Last updated
Was this helpful?