MLSys 2025

Meta Info

Homepage: https://mlsys.org/Conferences/2025

Paper list: https://mlsys.org/virtual/2025/papers.html?filter=titles

Acceptance Rate

22.5% (= 61 / 271)

Papers

Large Language Models (LLMs)

  • LLM Training

    • Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training [Paper]

      • Cornell & Meta & MIT

    • PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training [Paper] [arXiv]

      • CMU & AWS

    • Scaling Deep Learning Training with MPMD Pipeline Parallelism [Paper] [arXiv]

      • NVIDIA

    • Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer [Paper] [arXiv]

      • OSU & Microsoft

    • APOLLO: SGD-like Memory, AdamW-level Performance [Paper] [Homepage] [arXiv] [Code]

      • UT-Austin & Meta AI

      • Outstanding Paper Honorable Mention

    • Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training [Paper]

      • Rutgers

    • Photon: Federated LLM Pre-Training [Paper] [arXiv]

      • UCambridge

    • Balancing Pipeline Parallelism with Vocabulary Parallelism [Paper] [arXiv] [Code]

      • Sea AI Lab

    • Youmu: Efficient Columnar Data Pipeline for LLM Training [Paper] [Slides]

      • UVA & UofT & CUHK

  • LLM Inference

    • XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [Paper] [arXiv] [Homepage] [Code]

      • CMU & NVIDIA & SJTU & UC Berkeley

    • Seesaw: High-throughput LLM Inference via Model Re-sharding [Paper] [arXiv]

      • UofT

      • Outstanding Paper Honorable Mention

    • NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [Paper] [arXiv]

      • Harvard & UC Berkeley

    • FlexInfer: Flexible LLM Inference with CPU Computations [Paper]

      • GaTech

    • SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling [Paper]

      • THU

    • Marconi: Prefix Caching for the Era of Hybrid LLMs [Paper] [arXiv]

      • Princeton & AWS

      • Outstanding Paper Honorable Mention

    • Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving [Paper]

    • QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [Paper] [arXiv] [Homepage] [Code]

      • MIT

    • ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments [Paper] [arXiv]

      • UCambridge & PKU & ETH

    • Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [Paper] [arXiv]

      • Qualcomm AI Research

    • Context Parallelism for Scalable Million-Token Inference [Paper] [arXiv]

      • Meta

    • MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [Paper] [arXiv]

      • Yale & IIT Roorkie & IBM Research

  • Attention Mechanisms

    • FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [Paper] [arXiv] [Homepage] [Code]

      • UW & NVIDIA

      • Outstanding Paper Award

    • LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [Paper] [arXiv] [Homepage] [Code]

      • MIT & NVIDIA

    • FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference [Paper]

      • UCSD & AWS

    • FlexAttention: A Programming Model for Generating Fused Attention Variants [Paper] [arXiv]

      • Meta

    • LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [Paper] [arXiv]

      • Microsoft

    • TurboAttention: Efficient Attention Approximation for High Throughputs LLMs [Paper] [arXiv]

      • Microsoft & GaTech

    • SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [Paper] [arXiv]

      • PKU & CUHK & Zhipu AI & THU & Shanghai AI Lab

  • RLHF Training

    • ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [Paper] [arXiv] [Code]

      • THU

  • MoE Inference

    • COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [Paper] [arXiv]

      • ByteDance Seed & SJTU

      • Outstanding Paper Honorable Mention

    • MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators [Paper] [Code]

      • UIUC

  • LoRA Fine-tuning

    • HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression [Paper] [Slides]

      • THU

  • LLM Distillation

    • Self-Data Distillation for Recovering Quality in Pruned Large Language Models [Paper] [arXiv]

      • Cerebras Systems

  • LLM Agent Simulation

    • AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution [Paper] [arXiv]

      • Stanford & GaTech

  • LLM for Relational Data Analytics

    • Optimizing LLM Queries in Relational Data Analytics Workloads [Paper] [arXiv]

      • UC Berkeley

Diffusion Models

  • Video Generation

    • ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation [Paper]

      • UofT & AWS

  • Image Generation

    • DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling [Paper] [arXiv]

      • UMass Amherst & Adobe Research

      • Construct model cascades → Easy queries can be processed by more lightweight diffusion models

Resource Management

  • Scheduling

    • LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions [Paper] [arXiv]

      • Google

      • Outstanding Paper Honorable Mention

    • Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling [Paper] [arXiv]

      • ECNU & Alibaba & HUST

  • Virtual CPU Oversubscription

    • ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud [Paper] [Slides]

      • Microsoft

  • AIOps

    • AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [Paper] [arXiv]

      • Microsoft

Deep Learning Compilation

  • TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives [Paper]

    • ByteDance Seed

Super-Resolution

  • VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution [Paper] [arXiv]

    • UW-Madison & USC & MSRA

PDF Parsing

  • AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine [Paper] [Code] [Slides]

    • UChicago & Argonne National Laboratory

Acronyms

  • RLHF: Reinforcement Learning from Human Feedback

  • MoE: Mixture-of-Experts

  • LoRA: Low-Rank Adaptation

  • LUT: Lookup Table

Last updated

Was this helpful?