ATC 2024

Meta Info

Homepage: https://www.usenix.org/conference/atc24

Paper list: https://www.usenix.org/conference/atc24/technical-sessions

Papers

Large Language Models (LLMs)

  • Serving LLMs

    • Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Paper]

      • NUS & SJTU & Huawei Cloud

      • Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.

    • Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs [Paper] [Code]

      • Sydney & Microsoft & Rutgers

      • TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).

  • LLM alignment / RLHF training

    • PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch [Paper]

      • THU

      • Intra-stage switching: explore model affinities and overlap computation via time-sharing.

      • Inter-stage switching: find the optimal switch plan with the minimum communication cost.

      • Based on Megatron-LM.

  • LLM federated fine-tuning

    • FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences [Paper] [Code]

      • BUPT

      • Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.

  • LLM training

    • Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper] [Code]

      • Kuaishou

      • The balance between computation and memory utilization.

      • Two activation rematerialization strategies

        • Pipeline-parallel-aware offloading to maximize the utilization of host memory for storing activations.

        • Compute-memory balanced checkpointing to balance between activation memory and computational efficiency.

Reliability

  • AI Infra

    • SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [Paper] [Code]

      • MSR & Microsoft

        • Best Paper Award

        • SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation (i.e., gray failure) caused by hardware redundancies and enhances overall reliability.

        • A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.

  • HBM

    • Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [Paper] [Code]

      • Xiamen University & Huawei & Minjiang University

        • Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.

        • Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.

Supercomputer

  • Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?

    • THU & SDU & National Supercomputer Center in Wuxi

    • A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.

    • Notice: The data is currently not available.

Distributed Training

  • Metis: Fast Automatic Distributed Training on Heterogeneous GPUs [Paper]

    • Samsung Research & UNIST

    • Metis, a system to automatically finds efficient parallelism plans for distributed training on heterogeneous GPUs.

    • Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.

    • Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).

Data Preprocessing

  • Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement [Paper] [Code]

    • ETH & Google

    • Dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.

    • Analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput

Serverless Computing

  • Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu [Paper]

    • SJTU, IPAPS & Huawei Cloud & EPFL

    • Jiagu, a serverless system based on OpenFaaS

      • Pre-decision scheduling: decouple prediction and decision-making; predict every function's capacities on a server using a model.

      • Dual-staged scaling: frequent adjustment of instances.

  • ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions [Paper] [Code]

    • UVA & George Mason University & Adobe Research

    • ALPS: Adaptive Learning, Priority Scheduler

      • Application-aware kernel scheduler

      • Frontend: user-space; approximate shortest remaining process time (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.

      • Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.

  • StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow [Paper] [Code]

    • HUST & INRIA

    • One GPU runtime per inference workflow instead of one GPU runtime per function.

    • Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.

  • A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND [Paper] [Code]

    • Sungkyunkwan University & Yonsei University & Seoul National University

    • Enhance performance while maintaining strict data isolation between requests.

    • The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.

Model Serving

  • Power-aware Deep Learning Model Serving with μ-Serve [Paper]

    • UIUC & IBM Research

    • Scaling GPU frequency for power saving without SLO attainment violations.

Cluster Scheduler

  • Starburst: A Cost-aware Scheduler for Hybrid Cloud [Paper] [Code]

    • UC Berkeley & UCSB

    • Distinguished Artifact Award

    • Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.

    • Dynamically control jobs' waiting times to improve utilization.

      • Assign longer waits for large jobs to increase their chances of running on the cluster.

      • Assign shorter waits to small jobs to increase their chances of running on the cloud.

Deep Learning Compiler

  • MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States [Paper] [Code]

    • THU

    • Generate more complete operator graphs by collecting key runtime information through monitoring program execution.

    • Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.

Deep Learning Recommendation Models (DLRMs)

  • OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model [Paper]

    • UCSD & UCSB & Meta & Pacific Northwest National Laboratory

    • Provide a near-optimal parallelization strategy for embedding tables.

Probabilistic Graphical Models

  • Fast Inference for Probabilistic Graphical Models [Paper] [Code]

    • University of Western Australia & HKUST

    • Fast-PGM: a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms.

Remote Direct Memory Access (RDMA)

  • PeRF: Preemption-enabled RDMA Framework [Paper]

    • Acryl Inc. & Sungkyunkwan University

    • Offer software-based performance isolation for efficient multi-tenancy in RDMA.

Remote Procedure Call (RPC)

  • HydraRPC: RPC in the CXL Era [Paper]

    • Alibaba & THU & ZJU & PKU

    • Utilize CXL-attached HDM to build RPC systems.

Journaling File System

  • FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling

    • Google

    • Best Paper Award

Rust-for-Linux

  • An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise [Paper] [Code]

    • BUPT & UESTC

Last updated