ATC 2024

Meta Info

Homepage: https://www.usenix.org/conference/atc24

Paper list: https://www.usenix.org/conference/atc24/technical-sessions

Acceptance Rate

15.8% (= 77 / 488)

Papers

Large Language Models (LLMs)

Serving LLMs
- Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Paper]
  - NUS & SJTU & Huawei Cloud
  - Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.
- Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs [Paper] [Code]
  - Sydney & Microsoft & Rutgers
  - TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).
LLM alignment / RLHF training
- PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch [Paper]
  - THU
  - Intra-stage switching: explore model affinities and overlap computation via time-sharing.
  - Inter-stage switching: find the optimal switch plan with the minimum communication cost.
  - Based on Megatron-LM.
LLM federated fine-tuning
- FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences [Paper] [Code]
  - BUPT
  - Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.
LLM training
- Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper] [Code]
  - Kuaishou
  - The balance between computation and memory utilization.
  - Two activation rematerialization strategies
    Pipeline-parallel-aware offloading to maximize the utilization of host memory for storing activations.
    Compute-memory balanced checkpointing to balance between activation memory and computational efficiency.

Reliability

AI Infra
- SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [Paper] [Code]
  - MSR & Microsoft
    Best Paper Award
    SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation (i.e., gray failure) caused by hardware redundancies and enhances overall reliability.
    A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.
HBM
- Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [Paper] [Code]
  - Xiamen University & Huawei & Minjiang University
    Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.
    Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.

Supercomputer

Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
- THU & SDU & National Supercomputer Center in Wuxi
- A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.
- Notice: The data is currently not available.

Distributed Training

Metis: Fast Automatic Distributed Training on Heterogeneous GPUs [Paper]
- Samsung Research & UNIST
- Metis, a system to automatically finds efficient parallelism plans for distributed training on heterogeneous GPUs.
- Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.
- Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).

Data Preprocessing

Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement [Paper] [Code]
- ETH & Google
- Dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.
- Analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput

Serverless Computing

Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu [Paper]
- SJTU, IPAPS & Huawei Cloud & EPFL
- Jiagu, a serverless system based on OpenFaaS
  - Pre-decision scheduling: decouple prediction and decision-making; predict every function's capacities on a server using a model.
  - Dual-staged scaling: frequent adjustment of instances.
ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions [Paper] [Code]
- UVA & George Mason University & Adobe Research
- ALPS: Adaptive Learning, Priority Scheduler
  - Application-aware kernel scheduler
  - Frontend: user-space; approximate shortest remaining process time (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.
  - Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow [Paper] [Code]
- HUST & INRIA
- One GPU runtime per inference workflow instead of one GPU runtime per function.
- Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.
A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND [Paper] [Code]
- Sungkyunkwan University & Yonsei University & Seoul National University
- Enhance performance while maintaining strict data isolation between requests.
- The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.

Model Serving

Power-aware Deep Learning Model Serving with μ-Serve [Paper]
- UIUC & IBM Research
- Scaling GPU frequency for power saving without SLO attainment violations.

Cluster Scheduler

Starburst: A Cost-aware Scheduler for Hybrid Cloud [Paper] [Code]
- UC Berkeley & UCSB
- Distinguished Artifact Award
- Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.
- Dynamically control jobs' waiting times to improve utilization.
  - Assign longer waits for large jobs to increase their chances of running on the cluster.
  - Assign shorter waits to small jobs to increase their chances of running on the cloud.

Deep Learning Compiler

MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States [Paper] [Code]
- THU
- Generate more complete operator graphs by collecting key runtime information through monitoring program execution.
- Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.

Deep Learning Recommendation Models (DLRMs)

OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model [Paper]
- UCSD & UCSB & Meta & Pacific Northwest National Laboratory
- Provide a near-optimal parallelization strategy for embedding tables.

Probabilistic Graphical Models

Fast Inference for Probabilistic Graphical Models [Paper] [Code]
- University of Western Australia & HKUST
- Fast-PGM: a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms.

Remote Direct Memory Access (RDMA)

PeRF: Preemption-enabled RDMA Framework [Paper]
- Acryl Inc. & Sungkyunkwan University
- Offer software-based performance isolation for efficient multi-tenancy in RDMA.

Remote Procedure Call (RPC)

HydraRPC: RPC in the CXL Era [Paper]
- Alibaba & THU & ZJU & PKU
- Utilize CXL-attached HDM to build RPC systems.

Journaling File System

FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling
- Google
- Best Paper Award

Rust-for-Linux

An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise [Paper] [Code]
- BUPT & UESTC

Last updated 6 months ago

Was this helpful?