MLSys 2025

Meta Info

Homepage: https://mlsys.org/Conferences/2025

Paper list: https://mlsys.org/virtual/2025/papers.html?filter=titles

Acceptance Rate

22.5% (= 61 / 271)

Papers

Large Language Models (LLMs)

LLM Training
- Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training [Paper]
  - Cornell & Meta & MIT
- PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training [Paper] [arXiv]
  - CMU & AWS
- Scaling Deep Learning Training with MPMD Pipeline Parallelism [Paper] [arXiv]
  - NVIDIA
- Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer [Paper] [arXiv]
  - OSU & Microsoft
- APOLLO: SGD-like Memory, AdamW-level Performance [Paper] [Homepage] [arXiv] [Code]
  - UT-Austin & Meta AI
  - Outstanding Paper Honorable Mention
- Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training [Paper]
  - Rutgers
- Photon: Federated LLM Pre-Training [Paper] [arXiv]
  - UCambridge
- Balancing Pipeline Parallelism with Vocabulary Parallelism [Paper] [arXiv] [Code]
  - Sea AI Lab
- Youmu: Efficient Columnar Data Pipeline for LLM Training [Paper] [Slides]
  - UVA & UofT & CUHK
LLM Inference
- XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [Paper] [arXiv] [Homepage] [Code]
  - CMU & NVIDIA & SJTU & UC Berkeley
- Seesaw: High-throughput LLM Inference via Model Re-sharding [Paper] [arXiv]
  - UofT
  - Outstanding Paper Honorable Mention
- NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Paper] [arXiv]
  - Harvard & UC Berkeley
- FlexInfer: Flexible LLM Inference with CPU Computations [Paper]
  - GaTech
- SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling [Paper]
  - THU
- Marconi: Prefix Caching for the Era of Hybrid LLMs [Paper] [arXiv]
  - Princeton & AWS
  - Outstanding Paper Honorable Mention
- Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [Paper]
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [Paper] [arXiv] [Homepage] [Code]
  - MIT
- ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments [Paper] [arXiv]
  - UCambridge & PKU & ETH
- Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [Paper] [arXiv]
  - Qualcomm AI Research
- Context Parallelism for Scalable Million-Token Inference [Paper] [arXiv]
  - Meta
- MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [Paper] [arXiv]
  - Yale & IIT Roorkie & IBM Research
Attention Mechanisms
- FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper] [arXiv] [Homepage] [Code]
  - UW & NVIDIA
  - Outstanding Paper Award
- LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [Paper] [arXiv] [Homepage] [Code]
  - MIT & NVIDIA
- FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference [Paper]
  - UCSD & AWS
- FlexAttention: A Programming Model for Generating Fused Attention Variants [Paper] [arXiv]
  - Meta
- LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [Paper] [arXiv]
  - Microsoft
- TurboAttention: Efficient Attention Approximation for High Throughputs LLMs [Paper] [arXiv]
  - Microsoft & GaTech
- SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [Paper] [arXiv]
  - PKU & CUHK & Zhipu AI & THU & Shanghai AI Lab
RLHF Training
- ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [Paper] [arXiv] [Code]
  - THU
MoE Inference
- COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper] [arXiv]
  - ByteDance Seed & SJTU
  - Outstanding Paper Honorable Mention
- MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators [Paper] [Code]
  - UIUC
LoRA Fine-tuning
- HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression [Paper] [Slides]
  - THU
LLM Distillation
- Self-Data Distillation for Recovering Quality in Pruned Large Language Models [Paper] [arXiv]
  - Cerebras Systems
LLM Agent Simulation
- AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution [Paper] [arXiv]
  - Stanford & GaTech
LLM for Relational Data Analytics
- Optimizing LLM Queries in Relational Data Analytics Workloads [Paper] [arXiv]
  - UC Berkeley

Diffusion Models

Video Generation
- ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation [Paper]
  - UofT & AWS
Image Generation
- DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling [Paper] [arXiv]
  - UMass Amherst & Adobe Research
  - Construct model cascades → Easy queries can be processed by more lightweight diffusion models

Resource Management

Scheduling
- LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions [Paper] [arXiv]
  - Google
  - Outstanding Paper Honorable Mention
- Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling [Paper] [arXiv]
  - ECNU & Alibaba & HUST
Virtual CPU Oversubscription
- ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud [Paper] [Slides]
  - Microsoft
AIOps
- AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [Paper] [arXiv]
  - Microsoft

Deep Learning Compilation

TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [Paper]
- ByteDance Seed

Super-Resolution

VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution [Paper] [arXiv]
- UW-Madison & USC & MSRA

PDF Parsing

AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine [Paper] [Code] [Slides]
- UChicago & Argonne National Laboratory

Acronyms

RLHF: Reinforcement Learning from Human Feedback
MoE: Mixture-of-Experts
LoRA: Low-Rank Adaptation
LUT: Lookup Table

Last updated 3 months ago

Was this helpful?