ATC 2025

Meta Info

Homepage: https://www.usenix.org/conference/atc25

Paper list: https://www.usenix.org/conference/atc25/technical-sessions

Acceptance Rate

15.8% (= 100 / 634)

Papers

Large Language Models (LLMs)

LLM Training
- GreyHound: Hunting Fail-Slows in Hybrid-Parallel Training at Scale [Paper] [Video] [Code]
  - HKUST & Alibaba
  - Takeaways of the characterization study.
    Fail-slows are usually transient, primarily caused by degradation in computation (slow GPUs or CPU contention) and communication (network congestion).
    Computation fail-slows tend to be short-lived and less frequent; communication fail-slows due to network congestion are more common and tend to last longer.
    As training scales up, the likelihood of simultaneously encountering multiple performance issues increases.
  - GreyHound-Detect
    Use the Bayesian online change-point detection (BOCD) algorithm and a verification check to differentiate between real fail-slow issues and normal performance jitters.
    Change-point verification: compare the average iteration time before and after each identified change-point, treating it as a jitter if the difference is less than 10%.
  - GreyHound-Mitigate
    Ski-rental-like multi-level straggler mitigation → Begin with a low-cost strategy and progressively switch to more effective and more costly strategies if fail-slow persists and the current approach proves ineffective.
    Adjust the number of micro-batches allocated to DP groups according to their computation performance.
    Adjust the parallelism topology to reduce congestion and minimize PP stages affected by stragglers.
- CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training [Paper] [Video] [Slides] [Code]
  - ETH
  - Present a latency and bandwidth-aware performance model designed for the cross-DC environment.
  - PP is better than DP in cross-DC training.
  - Decouple block scheduling from communication arrangement.
- Obscura: Concealing Recomputation Overhead in Training of Large Language Models with Bubble-filling Pipeline Transformation [Paper] [Slides] [Video]
  - SYSU
  - Leverage pipeline transformation to better conceal recomputation overhead.
- Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation [Paper] [Video]
  - Harvard & ByteDance & USC
  - Schedule the encoder computation within the LLM bubbles.
- FlexPipe: Maximizing Training Efficiency for Transformer-based Models with Variable-Length Inputs [Paper] [Video]
  - Jilin University & UC Riverside
  - Dynamically adjust PP by a live flexibility mechanism.
LLM Inference
- DeepServe: Serverless Large Language Model Serving at Scale [Paper] [Video]
  - PKU & Huawei Cloud
- Weaver: Efficient Multi-LLM Serving with Attention Offloading [Paper] [Video] [Slides]
  - THU
  - Opportunity: offload attention from hot to cold instances.
  - Challenge 1: The offloaded attention is blocked by many pre-issued kernels
    Solution: GPU-driven control flow
    Pipeline: The sender (hot model) writes QKV results in GPU shared memory & updates task counter → The receiver (cold model) executes a polling kernel to select a task → The receiver writes the output in GPU shared memory & updates completion counter → The sender waits for the output.
  - Challenge 2: The offloaded task is blocked by a single long-running kernel.
    Solution: Operator splitting
    Pipeline: Sort by the operator's running time → Split the biggest operator into two halves and insert back → Reinsert into the queue & repeat until the waiting time < threshold
- Toppings: CPU-Assisted, Rank-Aware Adapter Serving for LLM Inference [Paper]
  - HKUST & CUHK-SZ & TeleAI & Huawei Cloud
  - A system to serve many LoRA adapters derived from a common base model.
  - Pin the base model on GPUs and dynamically load the requested LoRA adapters from host memory as new requests arrive.
  - Use CPUs to compute the lightweight adaptation for prefilling & switch to the GPUs after loading completes to resume the remaining computation.
  - Schedule heterogeneous LoRA requests to maximize the SLO attainment.
- QFactory: Accelerating Quantized Large Language Model Serving with Qtile Graphs [Paper] [Video] [Artifact]
  - THU
  - A compilation framework to generate high-performance quantized kernels.
  - Transform the traditional tensor computation graph into a Qtile-graph (Qgraph)
  - Explore graph-level Qtile computation transformations to generate equivalent QGraphs.
  - Employ operator-level Qtile scheduling to identify optimal memory loading strategies for each Qtile within the QGraph before generating the final code.
- CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge [Paper] [Video] [Slides]
  - Macau
  - Offline device-specific tailoring
    LLM layers contribute unevenly to effectiveness and efficiency → Fine-grained layer-wise tuning.
  - Online latency-aware inference
    Request-wise MoE-based router → Dynamically merge LoRA modules for each mixed-task prompt.
    Learning-based DVFS (Dynamic Voltage and Frequency Scaling) controller → Reduce per-generated token energy consumption while satisfying the real-time latency target at the layer-wise level.
LLM Fine-Tuning
- JENGA: Enhancing LLM Long-Context Fine-tuning with Contextual Token Sparsity [Paper] [Slides] [Video] [Artifact]
  - THU & MSRA
  - Exploit a new token-level sparsity mechanism inherent in long-context scenarios.
- mTuner: Accelerating Parameter-Efficient Fine-Tuning on Multi-GPU Servers with Elastic Tensor [Paper] [Video] [Code]
  - THU
  - Elastic Tensor, an abstraction for dynamic tensor management → Enable flexible control over their availability, accumulation, and release in memory.
Resource Multiplexing
- Resource Multiplexing in Tuning and Serving Large Language Models [Paper] [Video] [Artifact]
  - ETH
  - LLMStation
    A new iteration-level multitasking scheduling mechanism.
    An Autograd engine to transform a tuning task into a suspendable pipeline.
    An inference engine to batch inference and tuning requests.
KV Cache Management
- KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider [Paper] [Slides] [Video] [Trace]
  - SJTU IPADS & Alibaba Cloud
  - Key takeaways from the characterization study
    KV$ reuses are common, but the reuse ratio is smaller than previously reported numbers on synthetic datasets.
    For each specific request category, the reuse time is predictable based on the historical information.
    The lifespan of KV$ is ephemeral.
SpMM
- GeneralSparse: Bridging the Gap in SpMM for Pruned Large Language Model Inference on GPUs [Paper] [Slides] [Video] [Code]
  - ICT, CAS
  - Pruned weight + batch size of dense matrix → SpMM program
- Voltrix: Sparse Matrix-Matrix Multiplication on Tensor Cores with Asynchronous and Balanced Kernel Optimization [Paper] [Video]
  - WHU & NVIDIA & Macau

Mixture-of-Experts (MoE)

MoE Training
- PopFetcher: Towards Accelerated Mixture-of-Experts Training Via Popularity Based Expert-Wise Prefetch [Paper] [Slides] [Video]
  - HUST
  - Prefetch high-demand experts in the next layer during the execution of current non-MoE computations.
  - Prioritize All-to-All communication stream over All-Reduce operation among prefetched experts.

Diffusion Models

Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters [Paper] [Video] [Slides] [Code] [Trace]
- HKUST & Alibaba
- Our work!
- ControlNet-as-a-Service → Enable ControlNet caching, parallelization, and sharing.
- Bounded Asynchronous Loading (BAL) → Overlap LoRA loading with initial base model execution by a maximum of K steps.
- Latent parallelism → Accelerate base model execution across multiple GPUs.

Deep Learning Recommendation Models (DLRMs)

DLRM Training
- Primus: Unified Training System for Large-Scale Deep Learning Recommendation Models [Paper] [Slides]
  - ByteDance
  - Unified resource scheduling
    Unified API-server upon a diverse cluster.
    Provide both dynamic horizontal and vertical scaling mechanisms.
    Standardize YARN & Kubernetes scheduling semantics.
  - Unified data orchestration
    Support batch and stream data mixture with a three-tier data definition (Dataset, Data Stream, Data Source).
    Provide a graph-based task planner to accelerate training task generation.
  - Unified training paradigm
    Mixture Training Recommendation Model (MTRM), a new model with memory and adaptive towers to handle catastrophic forgetting and delayed feedback.

Deep Learning Compilation

PluS: Highly Efficient and Expandable ML Compiler with Pluggable Graph Schedules [Paper] [Video]
- RUC & Microsoft & THU

Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception [Paper] [Video]
- SJTU & Lenovo
- Krypton
- Intercept GPU command buffers at the kernel level to provide virtual GPU devices.
- The hardware units are divided using MIG, while time slices and device memory are allocated using the kernel-space scheduler.
GPreempt: GPU Preemptive Scheduling Made General and Efficient [Paper] [Slides] [Video] [Code]
- THU
- Implement a timeslice-based yield mechanism to enable context-switch preemption on GPUs.
- Employ a hint-based pre-preemption technique to overlap the preemption process with the essential data-preparation phase.
Colocating ML Inference and Training with Fast GPU Memory Handover [Paper] [Slides] [Video] [Code]
- SJTU IPADS
- Key insight: training task is elastic and reconfigurable; transfer memory between training and inference by reconfiguring training tasks (i.e., changing batch size).

Cloud Computing

Serverless Computing
- Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference [Paper] [Video] [Code]
  - CUHK-SZ & HKUST & Alibaba & Nokia Bell Labs
  - Maintain models in main memory and dynamically swap them onto GPUs upon request arrivals.
  - Several techniques to minimize latency overhead caused by model swapping: Asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management.
- Burst Computing: Quick, Sudden, Massively Parallel Processing on Serverless Resources [Paper] [Slides] [Video] [Code]
  - Universitat Rovira i Virgili & Barcelona Supercomputing Center
  - Key principle: group awareness.
Image provisioning
- Poby: SmartNIC-accelerated Image Provisioning for Coldstart in Clouds [Paper] [Slides] [Video] [Artifact]
  - ICT, CAS
  - Disaggregated architecture
    Orchestrate resources of the entire cluster to accelerate image provisioning.
  - Pipeline-based data-driven workflow
    Pipeline the workflow to enhance efficiency.
    Eliminate the overhead of control messages.
  - Distributed image download
    Image metadata index (IMI) records all images and node info.
    Keep all IMIs in memory.

Data Preprocessing

HyCache: Hybrid Caching for Accelerating DNN Input Preprocessing Pipelines [Paper] [Video]
- IISc & USC
- Enable the caching of subsets of preprocessed data from multiple intermediate steps on both memory and storage.

POSIX Shell

The Koala Benchmarks for the Shell: Characterization and Implications [Paper] [Video] [Homepage] [Benchmark Suite]
- Brown University
- Best Paper Award
- 14 sets of real-world shell programs from diverse domains ranging from CI/CD and AI/ML to biology and the humanities.

Acronyms

LoRA: Low-Rank Adaptation
MoE: Mixture-of-Experts
PP: Pipeline Parallelism
DP: Data Parallelism
SpMM: Sparse-dense Matrix Multiplication

Last updated 2 months ago

Was this helpful?

Meta Info

Acceptance Rate

Papers

Large Language Models (LLMs)

Mixture-of-Experts (MoE)

Diffusion Models

Deep Learning Recommendation Models (DLRMs)

Deep Learning Compilation

GPU Sharing

Cloud Computing

Data Preprocessing

POSIX Shell

Acronyms