# ATC 2024

## Meta Info

Homepage: <https://www.usenix.org/conference/atc24>

Paper list: <https://www.usenix.org/conference/atc24/technical-sessions>

### Acceptance Rate

15.8% (= 77 / 488)

## Papers

### Large Language Models (LLMs)

* Serving LLMs
  * Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention \[[Paper](https://www.usenix.org/conference/atc24/presentation/gao-bin-cost)]
    * NUS & SJTU & Huawei Cloud
    * Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.
  * Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs \[[Paper](https://www.usenix.org/conference/atc24/presentation/xia)] \[[Code](https://github.com/usyd-fsalab/fp6_llm)]
    * Sydney & Microsoft & Rutgers
    * **TC-FPx**, the first full-stack *GPU kernel design* scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).
* LLM alignment / RLHF training
  * PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch \[[Paper](https://www.usenix.org/conference/atc24/presentation/lei)]
    * THU
    * Intra-stage switching: explore model affinities and overlap computation via time-sharing.
    * Inter-stage switching: find the optimal switch plan with the minimum communication cost.
    * Based on Megatron-LM.
* LLM federated fine-tuning
  * FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences \[[Paper](https://www.usenix.org/conference/atc24/presentation/xu-mengwei)] \[[Code](https://github.com/UbiquitousLearning/FwdLLM)]
    * BUPT
    * Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.
* LLM training
  * Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism \[[Paper](https://www.usenix.org/conference/atc24/presentation/yuan)] \[[Code](https://github.com/kwai/Megatron-Kwai/tree/atc24ae/examples/atc24)]
    * Kuaishou
    * The balance between computation and memory utilization.
    * Two activation rematerialization strategies
      * *Pipeline-parallel-aware offloading* to maximize the utilization of host memory for storing activations.
      * *Compute-memory balanced checkpointing* to balance between activation memory and computational efficiency.

### Reliability

* AI Infra
  * SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation \[[Paper](https://www.usenix.org/conference/atc24/presentation/xiong)] \[[Code](https://github.com/microsoft/superbenchmark)]
    * MSR & Microsoft
      * **Best Paper Award**
      * SuperBench, a proactive validation system for AI infrastructure that mitigates *hidden degradation* (i.e., *gray failure*) caused by hardware redundancies and enhances overall reliability.
      * A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.
* HBM
  * Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field \[[Paper](https://www.usenix.org/conference/atc24/presentation/wu-ronglong)] \[[Code](https://github.com/wrl297/Calchas)]
    * Xiamen University & Huawei & Minjiang University
      * Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.
      * Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.

### Supercomputer

* Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
  * THU & SDU & National Supercomputer Center in Wuxi
  * A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.
  * **Notice**: The data is currently not available.

### Distributed Training

* Metis: Fast Automatic Distributed Training on Heterogeneous GPUs \[[Paper](https://www.usenix.org/conference/atc24/presentation/um)]
  * Samsung Research & UNIST
  * Metis, a system to automatically finds efficient parallelism plans for distributed training on *heterogeneous GPUs*.
  * Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.
  * Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).

### Data Preprocessing

* Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement \[[Paper](https://www.usenix.org/conference/atc24/presentation/graur)] \[[Code](https://github.com/eth-easl/pecan-experiments)]
  * ETH & Google
  * Dynamically *schedule data preprocessing workers on ML accelerator host resources* to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.
  * Analyze the characteristics of input pipelines and *automatically reorder transformations* to increase data preprocessing worker throughput

### Serverless Computing

* Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu \[[Paper](https://www.usenix.org/conference/atc24/presentation/liu-qingyuan)]
  * SJTU, IPAPS & Huawei Cloud & EPFL
  * **Jiagu**, a serverless system based on OpenFaaS
    * *Pre-decision scheduling:* decouple prediction and decision-making; predict every function's capacities on a server using a model.
    * *Dual-staged scaling:* frequent adjustment of instances.
* ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions \[[Paper](https://www.usenix.org/conference/atc24/presentation/fu)] \[[Code](https://github.com/ds2-lab/ALPS)]
  * UVA & George Mason University & Adobe Research
  * **ALPS**: **A**daptive **L**earning, **P**riority **S**cheduler
    * Application-aware kernel scheduler
    * Frontend: user-space; approximate *shortest remaining process time* (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.
    * Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.
* StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow \[[Paper](https://www.usenix.org/conference/atc24/presentation/wu-hao)] \[[Code](https://github.com/CGCL-codes/streambox)]
  * HUST & INRIA
  * *One GPU runtime per inference workflow* instead of *one GPU runtime per function*.
  * Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.
* A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND \[[Paper](https://www.usenix.org/conference/atc24/presentation/song)] \[[Code](https://github.com/s3yonsei/rewind_serverless)]
  * Sungkyunkwan University & Yonsei University & Seoul National University
  * Enhance performance while *maintaining strict data isolation between requests*.
  * The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.

### Model Serving

* Power-aware Deep Learning Model Serving with μ-Serve \[[Paper](https://www.usenix.org/conference/atc24/presentation/qiu)]
  * UIUC & IBM Research
  * Scaling GPU frequency for power saving without SLO attainment violations.

### Cluster Scheduler

* Starburst: A Cost-aware Scheduler for Hybrid Cloud \[[Paper](https://www.usenix.org/conference/atc24/presentation/luo)] \[[Code](https://github.com/michaelzhiluo/starburst)]
  * UC Berkeley & UCSB
  * Distinguished Artifact Award
  * Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.
  * Dynamically control jobs' waiting times to improve utilization.
    * Assign longer waits for large jobs to increase their chances of running on the cluster.
    * Assign shorter waits to small jobs to increase their chances of running on the cloud.

### Deep Learning Compiler

* MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States \[[Paper](https://www.usenix.org/conference/atc24/presentation/zhang-chen)] \[[Code](https://github.com/heheda12345/MagPy)]
  * THU
  * Generate more complete operator graphs by collecting key runtime information through monitoring program execution.
  * Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.

### Deep Learning Recommendation Models (DLRMs)

* OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model \[[Paper](https://www.usenix.org/conference/atc24/presentation/wang)]
  * UCSD & UCSB & Meta & Pacific Northwest National Laboratory
  * Provide a near-optimal parallelization strategy for embedding tables.

### Probabilistic Graphical Models

* Fast Inference for Probabilistic Graphical Models \[[Paper](https://www.usenix.org/conference/atc24/presentation/jiang)] \[[Code](https://github.com/jjiantong/FastPGM)]
  * University of Western Australia & HKUST
  * **Fast-PGM**: a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms.

### Remote Direct Memory Access (RDMA)

* PeRF: Preemption-enabled RDMA Framework \[[Paper](https://www.usenix.org/conference/atc24/presentation/lee)]
  * Acryl Inc. & Sungkyunkwan University
  * Offer *software-based performance isolation* for efficient *multi-tenancy* in RDMA.

### Remote Procedure Call (RPC)

* HydraRPC: RPC in the CXL Era \[[Paper](https://www.usenix.org/conference/atc24/presentation/ma)]
  * Alibaba & THU & ZJU & PKU
  * Utilize CXL-attached HDM to build RPC systems.

### Journaling File System

* FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling
  * Google
  * **Best Paper Award**

### Rust-for-Linux

* An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise \[[Paper](https://www.usenix.org/conference/atc24/presentation/li-hongyu)] \[[Code](https://github.com/Richardhongyu/rfl_empirical_tools)]
  * BUPT & UESTC
