Meta Info
Homepage: https://mlsys.org/Conferences/2025
Paper list: https://mlsys.org/virtual/2025/papers.html?filter=titles
Acceptance Rate
22.5% (= 61 / 271)
Papers
Large Language Models (LLMs)
LLM Training
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training [Paper]
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training [Paper] [arXiv]
Scaling Deep Learning Training with MPMD Pipeline Parallelism [Paper] [arXiv]
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer [Paper] [arXiv]
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training [Paper]
SystemX: Federated LLM Pre-Training [Paper]
Photon: Federated LLM Pre-Training [Paper] [arXiv]
Balancing Pipeline Parallelism with Vocabulary Parallelism [Paper] [arXiv] [Code]
Youmu: Efficient Columnar Data Pipeline for LLM Training [Paper]
LLM Inference
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [Paper] [arXiv] [Homepage] [Code]
CMU & NVIDIA & SJTU & UC Berkeley
Seesaw: High-throughput LLM Inference via Model Re-sharding [Paper] [arXiv]
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Paper] [arXiv]
FlexInfer: Flexible LLM Inference with CPU Computations [Paper]
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling [Paper]
Marconi: Prefix Caching for the Era of Hybrid LLMs [Paper] [arXiv]
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [Paper]
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments [Paper] [arXiv]
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [Paper] [arXiv]
Context Parallelism for Scalable Million-Token Inference [Paper] [arXiv]
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [Paper] [arXiv]
Yale & IIT Roorkie & IBM Research
Attention Mechanisms
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper] [arXiv] [Homepage] [Code]
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference [Paper]
FlexAttention: A Programming Model for Generating Fused Attention Variants [Paper] [arXiv]
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [Paper] [arXiv]
TurboAttention: Efficient Attention Approximation for High Throughputs LLMs [Paper] [arXiv]
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [Paper] [arXiv]
PKU & CUHK & Zhipu AI & THU & Shanghai AI Lab
RLHF Training
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [Paper] [arXiv] [Code]
MoE Inference
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper] [arXiv]
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators [Paper] [Code (incoming)]
LoRA Fine-tuning
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression [Paper]
LLM Distillation
Self-Data Distillation for Recovering Quality in Pruned Large Language Models [Paper] [arXiv]
LLM Agent Simulation
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution [Paper] [arXiv]
LLM for Relational Data Analytics
Optimizing LLM Queries in Relational Data Analytics Workloads [Paper] [arXiv]
Diffusion Models
Video Generation
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation [Paper]
Image Generation
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling [Paper] [arXiv]
UMass Amherst & Adobe Research
Construct model cascades → Easy queries can be processed by more lightweight diffusion models
Resource Management
Scheduling
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions [Paper] [arXiv]
Morphling: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling [Paper] [arXiv]
Virtual CPU Oversubscription
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud [Paper]
AIOps
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [Paper] [arXiv]
Deep Learning Compilation
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [Paper]
Super-Resolution
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution [Paper] [arXiv]
PDF Parsing
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine [Paper] [Code]
Acronyms
RLHF: Reinforcement Learning from Human Feedback
LoRA: Low-Rank Adaptation