SOSP 2025
Meta Info
Homepage: https://sigops.org/s/conferences/sosp/2025/
Acceptance rate: 17.7% (= 65 / 368)
Papers
Large Language Models (LLMs)
LLM Training
Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training
CUHK & ByteDance
LLM Inference
Effective Memory Management for Serving LLM with Heterogeneity [arXiv]
THU & Chicago & UC Berkeley
Two challenges
Recent models have heterogeneous embeddings with different sizes.
Some new architectures use only a subset of the prefix tokens to generate the next token.
Designs
Two-level memory allocator: choose the page size as least common multiple of token embedding sizes.
Enable attention variants to customize this mechanism by precisely specifying the exact prefix subset.
LMPrefill: An Inference Engine for Prefill-only Workloads in Large Language Model Applications [arXiv]
Chicago & THU & LinkedIn & UC Berkeley
Hybrid prefilling: Prefill non-attention layers chunk-by-chunk, but prefill the attention layers normally.
Suffix KV cache discarding / offloading: Discard the useless KV cache.
Continuous JCT calibration: Continuously reestimate the JCT of each request based on what requests are previously scheduled, and then schedules just one request with the lowest JCT.
Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market
PKU & Alibaba Cloud
Optimization
Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling
UCSD & Meta
GPU Checkpointing
PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation [arXiv]
SJTU IPADS
Last updated
Was this helpful?