ATC 2024
Meta Info
Homepage: https://www.usenix.org/conference/atc24
Paper list: https://www.usenix.org/conference/atc24/technical-sessions
Papers
Large Language Models (LLMs)
Serving LLMs
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention [Paper]
NUS & SJTU & Huawei Cloud
Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs [Paper] [Code]
Sydney & Microsoft & Rutgers
TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).
LLM alignment / RLHF training
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch [Paper]
THU
Intra-stage switching: explore model affinities and overlap computation via time-sharing.
Inter-stage switching: find the optimal switch plan with the minimum communication cost.
Based on Megatron-LM.
LLM federated fine-tuning
FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences [Paper] [Code]
BUPT
Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.
LLM training
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [Paper] [Code]
Kuaishou
The balance between computation and memory utilization.
Two activation rematerialization strategies
Pipeline-parallel-aware offloading to maximize the utilization of host memory for storing activations.
Compute-memory balanced checkpointing to balance between activation memory and computational efficiency.
Reliability
AI Infra
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [Paper] [Code]
MSR & Microsoft
Best Paper Award
SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation (i.e., gray failure) caused by hardware redundancies and enhances overall reliability.
A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.
HBM
Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [Paper] [Code]
Xiamen University & Huawei & Minjiang University
Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.
Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.
Supercomputer
Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
THU & SDU & National Supercomputer Center in Wuxi
A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.
Notice: The data is currently not available.
Distributed Training
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs [Paper]
Samsung Research & UNIST
Metis, a system to automatically finds efficient parallelism plans for distributed training on heterogeneous GPUs.
Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.
Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).
Data Preprocessing
Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement [Paper] [Code]
ETH & Google
Dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.
Analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput
Serverless Computing
Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu [Paper]
SJTU, IPAPS & Huawei Cloud & EPFL
Jiagu, a serverless system based on OpenFaaS
Pre-decision scheduling: decouple prediction and decision-making; predict every function's capacities on a server using a model.
Dual-staged scaling: frequent adjustment of instances.
ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions [Paper] [Code]
UVA & George Mason University & Adobe Research
ALPS: Adaptive Learning, Priority Scheduler
Application-aware kernel scheduler
Frontend: user-space; approximate shortest remaining process time (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.
Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow [Paper] [Code]
HUST & INRIA
One GPU runtime per inference workflow instead of one GPU runtime per function.
Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.
A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND [Paper] [Code]
Sungkyunkwan University & Yonsei University & Seoul National University
Enhance performance while maintaining strict data isolation between requests.
The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.
Model Serving
Power-aware Deep Learning Model Serving with μ-Serve [Paper]
UIUC & IBM Research
Scaling GPU frequency for power saving without SLO attainment violations.
Cluster Scheduler
Starburst: A Cost-aware Scheduler for Hybrid Cloud [Paper] [Code]
UC Berkeley & UCSB
Distinguished Artifact Award
Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.
Dynamically control jobs' waiting times to improve utilization.
Assign longer waits for large jobs to increase their chances of running on the cluster.
Assign shorter waits to small jobs to increase their chances of running on the cloud.
Deep Learning Compiler
MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States [Paper] [Code]
THU
Generate more complete operator graphs by collecting key runtime information through monitoring program execution.
Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.
Deep Learning Recommendation Models (DLRMs)
OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model [Paper]
UCSD & UCSB & Meta & Pacific Northwest National Laboratory
Provide a near-optimal parallelization strategy for embedding tables.
Probabilistic Graphical Models
Remote Direct Memory Access (RDMA)
PeRF: Preemption-enabled RDMA Framework [Paper]
Acryl Inc. & Sungkyunkwan University
Offer software-based performance isolation for efficient multi-tenancy in RDMA.
Remote Procedure Call (RPC)
HydraRPC: RPC in the CXL Era [Paper]
Alibaba & THU & ZJU & PKU
Utilize CXL-attached HDM to build RPC systems.
Journaling File System
FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling
Google
Best Paper Award
Rust-for-Linux
Last updated