📜
Awesome Papers
  • Introduction
  • Paper List
    • Systems for ML
      • Data Processing
      • Deep Learning Training
      • Resource Scheduler
      • Model Serving
      • Large Language Model (LLM)
      • Diffusion Models
      • Deep Learning Recommendation Model (DLRM)
      • Mixture of Experts (MoE)
      • Hyper-Parameter Tuning (HPO)
      • Reinforcement Learning (RL)
      • Deep Learning Compiler
      • Deep Learning Framework
      • Cloud-Edge Collaboration
    • ML for Systems
    • Artificial Intelligence (AI)
      • Diffusion Models
      • Language Models
      • Deep Learning Recommendation Model (DLRM)
    • Hardware Virtualization
      • GPU Sharing
    • Resource Disaggregation
      • GPU Disaggregation
      • Memory Disaggregation
    • Resource Fragmentation
    • Cloud Computing
      • Sky Computing
      • Serverless Computing
      • Spot Instances
    • Remote Direct Memory Access (RDMA)
    • Research Skills
    • Miscellaneous
  • Reading Notes
    • Conference
      • ICML 2025
      • ATC 2025
      • OSDI 2025
      • HotOS 2025
      • MLSys 2025
      • NSDI 2025
      • ASPLOS 2025
      • EuroSys 2025
      • HPCA 2025
      • PPoPP 2025
      • NeurIPS 2024
      • SoCC 2024
      • HotNets 2024
      • SC 2024
      • SOSP 2024
      • VLDB 2024
      • SIGCOMM 2024
      • ICML 2024
      • ATC 2024
      • OSDI 2024
      • ISCA 2024
      • CVPR 2024
      • MLSys 2024
      • ASPLOS 2024
        • SpotServe: Serving generative large language models on preemptible instances
      • EuroSys 2024
        • Orion: Interference-aware, fine-grained GPU sharing for ML applications
      • NSDI 2024
      • NeurIPS 2023
      • SC 2023
        • Interference-aware multiplexing for deep learning in GPU clusters: A middleware approach
      • SoCC 2023
      • SOSP 2023
        • UGache: A unified GPU cache for embedding-based deep learning
      • SIGCOMM 2023
      • HotChips 2023
      • ICML 2023
      • ATC 2023
        • Accelerating Distributed MoE Training and Inference with Lina
        • SmartMoE: Efficiently Training Sparsely-Activated Models ...
        • Beware of Fragmentation: Scheduling GPU-Sharing Workloads with Fragmentation Gradient Descent
      • OSDI 2023
      • HotOS 2023
      • SIGMOD 2023
      • ISCA 2023
      • MLSys 2023
      • EuroSys 2023
      • NSDI 2023
        • Shepherd: Serving DNNs in the wild
        • Understanding RDMA microarchitecture resources for performance isolation
        • Skyplane: Optimizing transfer cost and throughput using cloud-aware overlays
        • Shockwave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
      • ASPLOS 2023
        • TPP: Transparent page placement for CXL-enabled tiered-memory
        • EVStore: Storage and caching capabilities for scaling embedding tables in deep recommendation system
        • Lucid: A non-intrusive, scalable and interpretable scheduler for deep learning training jobs
      • SC 2022
      • SoCC 2022
        • ESCHER: Expressive scheduling with ephemeral resources
        • Serving unseen deep learning model with near-optimal configurations: A fast adaptive search approach
      • SIGCOMM 2022
        • Multi-resource interleaving for deep learning training
      • ATC 2022
        • PilotFish: Harvesting Free Cycles of Cloud Gaming with Deep Learning Training
        • Memory Harvesting in Multi-GPU Systems with Hierarchical Unified Virtual Memory
        • Whale: Efficient Giant Model Training over Heterogeneous GPUs
        • DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Service...
        • Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
        • SOTER: Guarding Black-box Inference for General Neural Networks at the Edge
        • Direct access, high-performance memory disaggregation with DirectCXL
      • OSDI 2022
        • Orca: A distributed serving system for transformer-based generative models
        • Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences
        • Looking beyond GPUs for DNN scheduling on multi-tenant clusters
      • IPDPS 2022
        • DGSF: Disaggregated GPUs for serverless functions
      • EuroSys 2022
        • Slashing the disaggregation tax in heterogeneous data centers with FractOS
      • NSDI 2022
      • SoCC 2021
      • ATC 2021
        • Zico: Efficient GPU memory sharing for concurrent DNN training
      • OSDI 2021
        • Pollux: Co-adaptive cluster scheduling for goodput-optimized deep learning
      • SOSP 2021
        • HeMem: Scalable Tiered Memory Management for Big Data Applications and Real NVM
      • EuroSys 2021
        • Take it to the limit: Peak prediction-driven resource overcommitment in datacenters
      • HotOS 2021
        • From cloud computing to sky computing
      • NSDI 2021
      • OSDI 2020
        • A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
        • HiveD: Sharing a GPU cluster for deep learning with guarantees
      • ATC 2020
        • Serverless in the wild: Characterizing and optimizing the serverless workload
      • EuroSys 2020
      • ASPLOS 2020
      • MLSys 2020
      • SoCC 2020
        • Elastic Parameter Server Load Distribution in Deep Learning Clusters
      • HPDC 2020
        • KubeShare: A framework to manage GPUs as first-class and shared resources in container cloud
      • CLUSTER 2019
      • EuroSys 2019
      • NSDI 2019
      • IWQoS 2019
        • Who limits the resource efficiency of my datacenter: An analysis of Alibaba datacenter traces
      • SIGCOMM 2018
        • Revisiting network support for RDMA
      • OSDI 2018
        • Ray: A distributed framework for emerging AI applications
      • EuroSys 2018
        • Medea: Scheduling of long running applications in shared production clusters
      • ISPA/IUCC/BDCloud/SocialCom/SustainCom 2018
        • GaiaGPU: Sharing GPUs in container clouds
      • SoCC 2017
        • SLAQ: Quality-driven scheduling for distributed machine learning
      • ASPLOS 2017
        • Neurosurgeon: Collaborative intelligence between the cloud and mobile edge
      • NSDI 2017
        • Clipper: A low-latency online prediction serving system
      • CLUSTER 2014
        • Evaluating job packing in warehouse-scale computing
    • Journal
      • IEEE Transactions on Cloud Computing
        • 2021
          • Gemini: Enabling multi-tenant GPU sharing based on kernel burst estimation
      • ACM Computing Surveys
        • 2017
          • GPU virtualization and scheduling methods: A comprehensive survey
      • ACM SIGCOMM Computer Communication Review (CCR)
        • 2021
          • Data-driven Networking Research: models for academic collaboration with industry
        • 2007
          • How to Read a Paper
      • Communications of the ACM
        • 2015
          • Why Google stores billions of lines of code in a single repository
    • Miscellaneous
      • arXiv
        • 2024
          • Efficiently programming large language models using SGLang
        • 2023
          • HexGen: Generative inference of foundation model over heterogeneous decentralized environment
          • High-throughput generative inference of large language models with a single GPU
        • 2022
          • DisaggRec: Architecting disaggregated systems for large-scale personalized recommendation
          • A case for disaggregation of ML data processing
          • Singularity: Planet-scale, preemptive and elastic scheduling of AI workloads
          • Aryl: An elastic cluster scheduler for deep learning
        • 2016
          • Wide & deep learning for recommender systems
          • Training deep nets with sublinear memory cost
      • MSR Technical Report
        • 2011
          • Heuristics for vector bin packing
  • About Myself
    • Academic Profile
    • Personal Blog (in Chinese)
Powered by GitBook
On this page
  • Meta Info
  • Acceptance Rate
  • Papers
  • Large Language Models (LLMs)
  • Reliability
  • Supercomputer
  • Distributed Training
  • Data Preprocessing
  • Serverless Computing
  • Model Serving
  • Cluster Scheduler
  • Deep Learning Compiler
  • Deep Learning Recommendation Models (DLRMs)
  • Probabilistic Graphical Models
  • Remote Direct Memory Access (RDMA)
  • Remote Procedure Call (RPC)
  • Journaling File System
  • Rust-for-Linux

Was this helpful?

Edit on GitHub
  1. Reading Notes
  2. Conference

ATC 2024

Last updated 4 months ago

Was this helpful?

Meta Info

Homepage:

Paper list:

Acceptance Rate

15.8% (= 77 / 488)

Papers

Large Language Models (LLMs)

  • Serving LLMs

    • Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention []

      • NUS & SJTU & Huawei Cloud

      • Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.

    • Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs [] []

      • Sydney & Microsoft & Rutgers

      • TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).

  • LLM alignment / RLHF training

    • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch []

      • THU

      • Intra-stage switching: explore model affinities and overlap computation via time-sharing.

      • Inter-stage switching: find the optimal switch plan with the minimum communication cost.

      • Based on Megatron-LM.

  • LLM federated fine-tuning

    • FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences [] []

      • BUPT

      • Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.

  • LLM training

    • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [] []

      • Kuaishou

      • The balance between computation and memory utilization.

      • Two activation rematerialization strategies

        • Pipeline-parallel-aware offloading to maximize the utilization of host memory for storing activations.

        • Compute-memory balanced checkpointing to balance between activation memory and computational efficiency.

Reliability

  • AI Infra

      • MSR & Microsoft

        • Best Paper Award

        • SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation (i.e., gray failure) caused by hardware redundancies and enhances overall reliability.

        • A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.

  • HBM

      • Xiamen University & Huawei & Minjiang University

        • Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.

        • Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.

Supercomputer

  • Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?

    • THU & SDU & National Supercomputer Center in Wuxi

    • A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.

    • Notice: The data is currently not available.

Distributed Training

    • Samsung Research & UNIST

    • Metis, a system to automatically finds efficient parallelism plans for distributed training on heterogeneous GPUs.

    • Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.

    • Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).

Data Preprocessing

    • ETH & Google

    • Dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.

    • Analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput

Serverless Computing

    • SJTU, IPAPS & Huawei Cloud & EPFL

    • Jiagu, a serverless system based on OpenFaaS

      • Pre-decision scheduling: decouple prediction and decision-making; predict every function's capacities on a server using a model.

      • Dual-staged scaling: frequent adjustment of instances.

    • UVA & George Mason University & Adobe Research

    • ALPS: Adaptive Learning, Priority Scheduler

      • Application-aware kernel scheduler

      • Frontend: user-space; approximate shortest remaining process time (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.

      • Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.

    • HUST & INRIA

    • One GPU runtime per inference workflow instead of one GPU runtime per function.

    • Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.

    • Sungkyunkwan University & Yonsei University & Seoul National University

    • Enhance performance while maintaining strict data isolation between requests.

    • The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.

Model Serving

    • UIUC & IBM Research

    • Scaling GPU frequency for power saving without SLO attainment violations.

Cluster Scheduler

    • UC Berkeley & UCSB

    • Distinguished Artifact Award

    • Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.

    • Dynamically control jobs' waiting times to improve utilization.

      • Assign longer waits for large jobs to increase their chances of running on the cluster.

      • Assign shorter waits to small jobs to increase their chances of running on the cloud.

Deep Learning Compiler

    • THU

    • Generate more complete operator graphs by collecting key runtime information through monitoring program execution.

    • Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.

Deep Learning Recommendation Models (DLRMs)

    • UCSD & UCSB & Meta & Pacific Northwest National Laboratory

    • Provide a near-optimal parallelization strategy for embedding tables.

Probabilistic Graphical Models

    • University of Western Australia & HKUST

    • Fast-PGM: a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms.

Remote Direct Memory Access (RDMA)

    • Acryl Inc. & Sungkyunkwan University

    • Offer software-based performance isolation for efficient multi-tenancy in RDMA.

Remote Procedure Call (RPC)

    • Alibaba & THU & ZJU & PKU

    • Utilize CXL-attached HDM to build RPC systems.

Journaling File System

  • FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling

    • Google

    • Best Paper Award

Rust-for-Linux

    • BUPT & UESTC

SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [] []

Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [] []

Metis: Fast Automatic Distributed Training on Heterogeneous GPUs []

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement [] []

Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu []

ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions [] []

StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow [] []

A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND [] []

Power-aware Deep Learning Model Serving with μ-Serve []

Starburst: A Cost-aware Scheduler for Hybrid Cloud [] []

MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States [] []

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model []

Fast Inference for Probabilistic Graphical Models [] []

PeRF: Preemption-enabled RDMA Framework []

HydraRPC: RPC in the CXL Era []

An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise [] []

https://www.usenix.org/conference/atc24
https://www.usenix.org/conference/atc24/technical-sessions
Paper
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Code
Paper
Paper
Code
Paper
Paper
Paper
Code