ATC 2025
Meta Info
Homepage: https://www.usenix.org/conference/atc25
Paper list: https://www.usenix.org/conference/atc25/technical-sessions
Acceptance Rate
15.8% (= 100 / 634)
Papers
Large Language Models (LLMs)
LLM Training
GreyHound: Hunting Fail-Slows in Hybrid-Parallel Training at Scale [Paper] [Video] [Code]
HKUST & Alibaba
Takeaways of the characterization study.
Fail-slows are usually transient, primarily caused by degradation in computation (slow GPUs or CPU contention) and communication (network congestion).
Computation fail-slows tend to be short-lived and less frequent; communication fail-slows due to network congestion are more common and tend to last longer.
As training scales up, the likelihood of simultaneously encountering multiple performance issues increases.
GreyHound-Detect
Use the Bayesian online change-point detection (BOCD) algorithm and a verification check to differentiate between real fail-slow issues and normal performance jitters.
Change-point verification: compare the average iteration time before and after each identified change-point, treating it as a jitter if the difference is less than 10%.
GreyHound-Mitigate
Ski-rental-like multi-level straggler mitigation → Begin with a low-cost strategy and progressively switch to more effective and more costly strategies if fail-slow persists and the current approach proves ineffective.
Adjust the number of micro-batches allocated to DP groups according to their computation performance.
Adjust the parallelism topology to reduce congestion and minimize PP stages affected by stragglers.
CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training [Paper] [Video] [Slides] [Code]
ETH
Present a latency and bandwidth-aware performance model designed for the cross-DC environment.
PP is better than DP in cross-DC training.
Decouple block scheduling from communication arrangement.
LLM Inference
Weaver: Efficient Multi-LLM Serving with Attention Offloading [Paper] [Video] [Slides]
THU
Opportunity: offload attention from hot to cold instances.
Challenge 1: The offloaded attention is blocked by many pre-issued kernels
Solution: GPU-driven control flow
Pipeline: The sender (hot model) writes QKV results in GPU shared memory & updates task counter → The receiver (cold model) executes a polling kernel to select a task → The receiver writes the output in GPU shared memory & updates completion counter → The sender waits for the output.
Challenge 2: The offloaded task is blocked by a single long-running kernel.
Solution: Operator splitting
Pipeline: Sort by the operator's running time → Split the biggest operator into two halves and insert back → Reinsert into the queue & repeat until the waiting time < threshold
Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference [Paper]
HKUST & CUHK-SZ & TeleAI & Huawei Cloud
A system to serve many LoRA adapters derived from a common base model.
Pin the base model on GPUs and dynamically load the requested LoRA adapters from host memory as new requests arrive.
Use CPUs to compute the lightweight adaptation for prefilling & switch to the GPUs after loading completes to resume the remaining computation.
Schedule heterogeneous LoRA requests to maximize the SLO attainment.
QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs [Paper] [Video] [Artifact]
THU
A compilation framework to generate high-performance quantized kernels.
Transform the traditional tensor computation graph into a Qtile-graph (Qgraph)
Explore graph-level Qtile computation transformations to generate equivalent QGraphs.
Employ operator-level Qtile scheduling to identify optimal memory loading strategies for each Qtile within the QGraph before generating the final code.
CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge [Paper] [Video] [Slides]
Macau
Offline device-specific tailoring
LLM layers contribute unevenly to effectiveness and efficiency → Fine-grained layer-wise tuning.
Online latency-aware inference
Request-wise MoE-based router → Dynamically merge LoRA modules for each mixed-task prompt.
Learning-based DVFS (Dynamic Voltage and Frequency Scaling) controller → Reduce per-generated token energy consumption while satisfying the real-time latency target at the layer-wise level.
LLM Fine-Tuning
Resource Multiplexing
KV Cache Management
KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider [Paper] [Slides] [Video] [Trace]
SJTU IPADS & Alibaba Cloud
Key takeaways from the characterization study
KV$ reuses are common, but the reuse ratio is smaller than previously reported numbers on synthetic datasets.
For each specific request category, the reuse time is predictable based on the historical information.
The lifespan of KV$ is ephemeral.
SpMM
Mixture-of-Experts (MoE)
MoE Training
PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch [Paper] [Slides] [Video]
HUST
Prefetch high-demand experts in the next layer during the execution of current non-MoE computations.
Prioritize All-to-All communication stream over All-Reduce operation among prefetched experts.
Diffusion Models
Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters [Paper] [Video] [Slides] [Code] [Trace]
HKUST & Alibaba
Our work!
ControlNet-as-a-Service → Enable ControlNet caching, parallelization, and sharing.
Bounded Asynchronous Loading (BAL) → Overlap LoRA loading with initial base model execution by a maximum of K steps.
Latent parallelism → Accelerate base model execution across multiple GPUs.
Deep Learning Recommendation Models (DLRMs)
DLRM Training
Primus: Unified Training System for Large-Scale Deep Learning Recommendation Models [Paper] [Slides]
ByteDance
Unified resource scheduling
Unified API-server upon a diverse cluster.
Provide both dynamic horizontal and vertical scaling mechanisms.
Standardize YARN & Kubernetes scheduling semantics.
Unified data orchestration
Support batch and stream data mixture with a three-tier data definition (Dataset, Data Stream, Data Source).
Provide a graph-based task planner to accelerate training task generation.
Unified training paradigm
Mixture Training Recommendation Model (MTRM), a new model with memory and adaptive towers to handle catastrophic forgetting and delayed feedback.
Deep Learning Compilation
GPU Sharing
Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception [Paper] [Video]
SJTU & Lenovo
Krypton
Intercept GPU command buffers at the kernel level to provide virtual GPU devices.
The hardware units are divided using MIG, while time slices and device memory are allocated using the kernel-space scheduler.
GPreempt: GPU Preemptive Scheduling Made General and Efficient [Paper] [Slides] [Video] [Code]
THU
Implement a timeslice-based yield mechanism to enable context-switch preemption on GPUs.
Employ a hint-based pre-preemption technique to overlap the preemption process with the essential data-preparation phase.
Cloud Computing
Serverless Computing
Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference [Paper] [Video] [Code]
CUHK-SZ & HKUST & Alibaba & Nokia Bell Labs
Maintain models in main memory and dynamically swap them onto GPUs upon request arrivals.
Several techniques to minimize latency overhead caused by model swapping: Asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management.
Image provisioning
Poby: SmartNIC-accelerated Image Provisioning for Coldstart in Clouds [Paper] [Slides] [Video] [Artifact]
ICT, CAS
Disaggregated architecture
Orchestrate resources of the entire cluster to accelerate image provisioning.
Pipeline-based data-driven workflow
Pipeline the workflow to enhance efficiency.
Eliminate the overhead of control messages.
Distributed image download
Image metadata index (IMI) records all images and node info.
Keep all IMIs in memory.
Data Preprocessing
POSIX Shell
The Koala Benchmarks for the Shell: Characterization and Implications [Paper] [Video] [Homepage] [Benchmark Suite]
Brown University
Best Paper Award
14 sets of real-world shell programs from diverse domains ranging from CI/CD and AI/ML to biology and the humanities.
Acronyms
LoRA: Low-Rank Adaptation
MoE: Mixture-of-Experts
PP: Pipeline Parallelism
DP: Data Parallelism
SpMM: Sparse-dense Matrix Multiplication
Last updated
Was this helpful?
