MLSys 2025
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
22.5% (= 61 / 271)
LLM Training
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training []
PipeFill: Using GPUs During Bubbles in Pipeline-parallel LLM Training [] []
CMU & AWS
Scaling Deep Learning Training with MPMD Pipeline Parallelism [] []
NVIDIA
Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer [] []
OSU & Microsoft
APOLLO: SGD-like Memory, AdamW-level Performance [] [] [] []
UT-Austin & Meta AI
Radius: Range-based Gradient Sparsity for Large Foundation Model Pre-training []
SystemX: Federated LLM Pre-Training []
Photon: Federated LLM Pre-Training [] []
UCambridge
Balancing Pipeline Parallelism with Vocabulary Parallelism [] [] []
Sea AI Lab
Youmu: Efficient Columnar Data Pipeline for LLM Training []
LLM Inference
XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models [] [] [] []
CMU & NVIDIA & SJTU & UC Berkeley
Seesaw: High-throughput LLM Inference via Model Re-sharding [] []
UofT
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference [] []
Harvard & UC Berkeley
FlexInfer: Flexible LLM Inference with CPU Computations []
GaTech
SOLA: Optimizing SLO Attainment for Large Language Model Serving with State-Aware Scheduling []
THU
Marconi: Prefix Caching for the Era of Hybrid LLMs [] []
Princeton & AWS
Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving []
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving [] [] [] []
MIT
ThunderServe: High-performance and Cost-efficient LLM Serving in Cloud Environments [] []
UCambridge & PKU & ETH
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking [] []
Qualcomm AI Research
Context Parallelism for Scalable Million-Token Inference [] []
Meta
MEADOW: Memory-efficient Dataflow and Data Packing for Low Power Edge LLMs [] []
Yale & IIT Roorkie & IBM Research
Attention Mechanisms
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving [] [] [] []
UW & NVIDIA
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [] [] [] []
MIT & NVIDIA
FastTree: Optimizing Attention Kernel and Runtime for Tree-Structured LLM Inference []
UCSD & AWS
FlexAttention: A Programming Model for Generating Fused Attention Variants [] []
Meta
LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [] []
Microsoft
TurboAttention: Efficient Attention Approximation for High Throughputs LLMs [] []
Microsoft & GaTech
SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention [] []
PKU & CUHK & Zhipu AI & THU & Shanghai AI Lab
RLHF Training
ReaL: Efficient RLHF Training of Large Language Models with Parameter Reallocation [] [] []
THU
MoE Inference
COMET: Fine-grained Computation-communication Overlapping for Mixture-of-Experts [] []
ByteDance Seed & SJTU
MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators [] [ (incoming)]
UIUC
LoRA Fine-tuning
HyC-LoRA: Memory Efficient LoRA Fine-tuning with Hybrid Activation Compression []
LLM Distillation
Self-Data Distillation for Recovering Quality in Pruned Large Language Models [] []
Cerebras Systems
LLM Agent Simulation
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution [] []
Stanford & GaTech
LLM for Relational Data Analytics
Optimizing LLM Queries in Relational Data Analytics Workloads [] []
UC Berkeley
Video Generation
Image Generation
UMass Amherst & Adobe Research
Construct model cascades → Easy queries can be processed by more lightweight diffusion models
Scheduling
ECNU & Alibaba & HUST
Virtual CPU Oversubscription
Microsoft
AIOps
Microsoft
UW-Madison & USC & MSRA
RLHF: Reinforcement Learning from Human Feedback
MoE: Mixture-of-Experts
LoRA: Low-Rank Adaptation
LUT: Lookup Table
ScaleFusion: Scalable Inference of Spatial-Temporal Diffusion Transformers for High-Resolution Long Video Generation []
DiffServe: Efficiently Serving Text-to-Image Diffusion Models with Query-Aware Model Scaling [] []
LAVA: Lifetime-Aware VM Allocation with Learned Distributions and Adaptation to Mispredictions [] []
Morphling: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling [] []
ProtoRAIL: A Risk-cognizant Imitation Agent for Adaptive vCPU Oversubscription In the Cloud []
AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [] []
TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives []
VoLUT: Efficient Volumetric streaming enhanced by LUT-based super-resolution [] []
AdaParse: An Adaptive Parallel PDF Parsing and Resource Scaling Engine [] []