# MLSys 2025

## Meta Info

Homepage: <https://mlsys.org/Conferences/2025>

Paper list: <https://mlsys.org/virtual/2025/papers.html?filter=titles>

### Acceptance Rate

22.5% (= 61 / 271)

## Papers

### Large Language Models (LLMs)

* LLM Training
  * Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training \[[Paper](https://mlsys.org/virtual/2025/poster/3240)]
    * Cornell & Meta & MIT
  * PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training \[[Paper](https://mlsys.org/virtual/2025/poster/3283)] \[[arXiv](https://arxiv.org/abs/2410.07192)]
    * CMU & AWS
  * Scaling Deep Learning Training with MPMD Pipeline Parallelism \[[Paper](https://mlsys.org/virtual/2025/poster/3257)] \[[arXiv](https://arxiv.org/abs/2412.14374)]
    * NVIDIA
  * Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer \[[Paper](https://mlsys.org/virtual/2025/poster/3268)] \[[arXiv](https://arxiv.org/abs/2408.16978)]
    * OSU & Microsoft
  * APOLLO: SGD-like Memory, AdamW-level Performance \[[Paper](https://mlsys.org/virtual/2025/poster/3241)] \[[Homepage](https://zhuhanqing.github.io/APOLLO/)] \[[arXiv](https://arxiv.org/abs/2412.05270)] \[[Code](https://github.com/zhuhanqing/APOLLO)]
    * UT-Austin & Meta AI
    * **Outstanding Paper Honorable Mention**
  * Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training \[[Paper](https://mlsys.org/virtual/2025/poster/3256)]
    * Rutgers
  * Photon: Federated LLM Pre-Training \[[Paper](https://mlsys.org/virtual/2025/poster/3280)] \[[arXiv](https://arxiv.org/abs/2411.02908)]
    * UCambridge
  * Balancing Pipeline Parallelism with Vocabulary Parallelism \[[Paper](https://mlsys.org/virtual/2025/poster/3237)] \[[arXiv](https://arxiv.org/abs/2411.05288)] \[[Code](https://github.com/sail-sg/VocabularyParallelism)]
    * Sea AI Lab
  * Youmu: Efficient Columnar Data Pipeline for LLM Training \[[Paper](https://mlsys.org/virtual/2025/poster/3272)] \[[Slides](https://mlsys.org/media/mlsys-2025/Slides/3272.pdf)]
    * UVA & UofT & CUHK
* LLM Inference
  * XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models \[[Paper](https://mlsys.org/virtual/2025/poster/3235)] \[[arXiv](https://arxiv.org/abs/2411.15100)] \[[Homepage](https://xgrammar.mlc.ai)] \[[Code](https://github.com/mlc-ai/xgrammar)]
    * CMU & NVIDIA & SJTU & UC Berkeley
  * Seesaw: High-throughput LLM Inference via Model Re-sharding \[[Paper](https://mlsys.org/virtual/2025/poster/3253)] \[[arXiv](https://arxiv.org/abs/2503.06433)]
    * UofT
    * **Outstanding Paper Honorable Mention**
  * NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference \[[Paper](https://mlsys.org/virtual/2025/poster/3230)] \[[arXiv](https://arxiv.org/abs/2411.01142)]
    * Harvard & UC Berkeley
  * FlexInfer: Flexible LLM Inference with CPU Computations \[[Paper](https://mlsys.org/virtual/2025/poster/3234)]
    * GaTech
  * SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling \[[Paper](https://mlsys.org/virtual/2025/poster/3231)]
    * THU
  * Marconi: Prefix Caching for the Era of Hybrid LLMs \[[Paper](https://mlsys.org/virtual/2025/poster/3260)] \[[arXiv](https://arxiv.org/abs/2411.19379)]
    * Princeton & AWS
    * **Outstanding Paper Honorable Mention**
  * Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving \[[Paper](https://mlsys.org/virtual/2025/poster/3236)]
  * QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving \[[Paper](https://mlsys.org/virtual/2025/poster/3288)] \[[arXiv](https://arxiv.org/abs/2405.04532)] \[[Homepage](https://hanlab.mit.edu/projects/qserve)] \[[Code](https://github.com/mit-han-lab/omniserve)]
    * MIT
  * ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments \[[Paper](https://mlsys.org/virtual/2025/poster/3284)] \[[arXiv](https://arxiv.org/abs/2502.09334)]
    * UCambridge & PKU & ETH
  * Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking \[[Paper](https://mlsys.org/virtual/2025/poster/3251)] \[[arXiv](https://arxiv.org/abs/2412.01380)]
    * Qualcomm AI Research
  * Context Parallelism for Scalable Million-Token Inference \[[Paper](https://mlsys.org/virtual/2025/poster/3255)] \[[arXiv](https://arxiv.org/abs/2411.01783)]
    * Meta
  * MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs \[[Paper](https://mlsys.org/virtual/2025/poster/3244)] \[[arXiv](https://arxiv.org/abs/2503.11663)]
    * Yale & IIT Roorkie & IBM Research
* Attention Mechanisms
  * FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving \[[Paper](https://mlsys.org/virtual/2025/poster/3259)] \[[arXiv](https://arxiv.org/abs/2501.01005)] \[[Homepage](https://flashinfer.ai)] \[[Code](https://github.com/flashinfer-ai/flashinfer)]
    * UW & NVIDIA
    * **Outstanding Paper Award**
  * LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention \[[Paper](https://mlsys.org/virtual/2025/poster/3270)] \[[arXiv](https://arxiv.org/abs/2502.14866)] \[[Homepage](https://hanlab.mit.edu/projects/lserve)] \[[Code](https://github.com/mit-han-lab/omniserve)]
    * MIT & NVIDIA
  * FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference \[[Paper](https://mlsys.org/virtual/2025/poster/3278)]
    * UCSD & AWS
  * FlexAttention: A Programming Model for Generating Fused Attention Variants \[[Paper](https://mlsys.org/virtual/2025/poster/3286)] \[[arXiv](https://arxiv.org/abs/2412.05496)]
    * Meta
  * LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers \[[Paper](https://mlsys.org/virtual/2025/poster/3269)] \[[arXiv](https://arxiv.org/abs/2405.10480)]
    * Microsoft
  * TurboAttention: Efficient Attention Approximation for High Throughputs LLMs \[[Paper](https://mlsys.org/virtual/2025/poster/3250)] \[[arXiv](https://arxiv.org/abs/2412.08585)]
    * Microsoft & GaTech
  * SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention \[[Paper](https://mlsys.org/virtual/2025/poster/3258)] \[[arXiv](https://arxiv.org/abs/2406.15486)]
    * PKU & CUHK & Zhipu AI & THU & Shanghai AI Lab
* RLHF Training
  * ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation \[[Paper](https://mlsys.org/virtual/2025/poster/3228)] \[[arXiv](https://arxiv.org/abs/2406.14088)] \[[Code](https://github.com/openpsi-project/ReaLHF)]
    * THU
* MoE Inference
  * COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts \[[Paper](https://mlsys.org/virtual/2025/poster/3246)] \[[arXiv](https://arxiv.org/abs/2502.19811)]
    * ByteDance Seed & SJTU
    * **Outstanding Paper Honorable Mention**
  * MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators \[[Paper](https://mlsys.org/virtual/2025/poster/3266)] \[[Code](https://github.com/Supercomputing-System-AI-Lab/MiLo)]
    * UIUC
* LoRA Fine-tuning
  * HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression \[[Paper](https://mlsys.org/virtual/2025/poster/3254)] \[[Slides](https://mlsys.org/media/mlsys-2025/Slides/3254_Yq1cyy5.pdf)]
    * THU
* LLM Distillation
  * Self-Data Distillation for Recovering Quality in Pruned Large Language Models \[[Paper](https://mlsys.org/virtual/2025/poster/3247)] \[[arXiv](https://arxiv.org/abs/2410.09982)]
    * Cerebras Systems
* LLM Agent Simulation
  * AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution \[[Paper](https://mlsys.org/virtual/2025/poster/3243)] \[[arXiv](https://arxiv.org/abs/2411.03519)]
    * Stanford & GaTech
* LLM for Relational Data Analytics
  * Optimizing LLM Queries in Relational Data Analytics Workloads \[[Paper](https://mlsys.org/virtual/2025/poster/3261)] \[[arXiv](https://arxiv.org/abs/2403.05821)]
    * UC Berkeley

### Diffusion Models

* Video Generation
  * ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation \[[Paper](https://mlsys.org/virtual/2025/poster/3252)]
    * UofT & AWS
* Image Generation
  * DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling \[[Paper](https://mlsys.org/virtual/2025/poster/3287)] \[[arXiv](https://arxiv.org/abs/2411.15381)]
    * UMass Amherst & Adobe Research
    * Construct model cascades → Easy queries can be processed by more lightweight diffusion models

### Resource Management

* Scheduling
  * LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions \[[Paper](https://mlsys.org/virtual/2025/poster/3281)] \[[arXiv](https://arxiv.org/abs/2412.09840)]
    * Google
    * **Outstanding Paper Honorable Mention**
  * Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling \[[Paper](https://mlsys.org/virtual/2025/poster/3232)] \[[arXiv](https://arxiv.org/abs/2408.08586)]
    * ECNU & Alibaba & HUST
* Virtual CPU Oversubscription
  * ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud \[[Paper](https://mlsys.org/virtual/2025/poster/3277)] \[[Slides](https://mlsys.org/media/mlsys-2025/Slides/3277_0U2rb6k.pdf)]
    * Microsoft
* AIOps
  * AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds \[[Paper](https://mlsys.org/virtual/2025/poster/3285)] \[[arXiv](https://arxiv.org/abs/2501.06706)]
    * Microsoft

### Deep Learning Compilation

* TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives \[[Paper](https://mlsys.org/virtual/2025/poster/3248)]
  * ByteDance Seed

### Super-Resolution

* VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution \[[Paper](https://mlsys.org/virtual/2025/poster/3275)] \[[arXiv](https://arxiv.org/abs/2502.12151)]
  * UW-Madison & USC & MSRA

### PDF Parsing

* AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine \[[Paper](https://mlsys.org/virtual/2025/poster/3229)] \[[Code](https://github.com/7shoe/AdaParse)] \[[Slides](https://mlsys.org/media/mlsys-2025/Slides/3229.pdf)]
  * UChicago & Argonne National Laboratory

## Acronyms

* RLHF: Reinforcement Learning from Human Feedback
* MoE: Mixture-of-Experts
* LoRA: Low-Rank Adaptation
* LUT: Lookup Table


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/mlsys-2025.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
