ATC 2024
Last updated
Was this helpful?
Last updated
Was this helpful?
Homepage:
Paper list:
15.8% (= 77 / 488)
Serving LLMs
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention []
NUS & SJTU & Huawei Cloud
Reuse KV caches across multi-turn conversations; maintain a hierarchical KV caching system; layer-wise pre-loading and asynchronous saving; scheduler-aware fetching and eviction.
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs [] []
Sydney & Microsoft & Rutgers
TC-FPx, the first full-stack GPU kernel design scheme with unified Tensor Core support of 6-bit and arbitrary bit-width quantization (e.g., 5-bit).
LLM alignment / RLHF training
PUZZLE: Efficiently Aligning Large Language Models through Light-Weight Context Switch []
THU
Intra-stage switching: explore model affinities and overlap computation via time-sharing.
Inter-stage switching: find the optimal switch plan with the minimum communication cost.
Based on Megatron-LM.
LLM federated fine-tuning
FwdLLM: Efficient Federated Finetuning of Large Language Models with Perturbed Inferences [] []
BUPT
Employ backpropagation (BP)-free training methods, requiring devices only to execute “perturbed inferences”; adaptively allocate computational loads across devices to balance between convergence speed and accuracy.
LLM training
Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism [] []
Kuaishou
The balance between computation and memory utilization.
Two activation rematerialization strategies
Pipeline-parallel-aware offloading to maximize the utilization of host memory for storing activations.
Compute-memory balanced checkpointing to balance between activation memory and computational efficiency.
AI Infra
MSR & Microsoft
Best Paper Award
SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation (i.e., gray failure) caused by hardware redundancies and enhances overall reliability.
A comprehensive benchmark suite to evaluate individual hardware components and represent most real AI workloads.
HBM
Xiamen University & Huawei & Minjiang University
Conduct the first systematical study on HBM errors, which cover over 460 million error events collected from nineteen data centers and span over two years of deployment under a variety of services.
Calchas, a hierarchical failure prediction framework for HBM integrates spatial, temporal, and sensor information from various device levels to predict upcoming failures.
Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
THU & SDU & National Supercomputer Center in Wuxi
A comprehensive analysis of six years’ worth of 40 TB data (comprising I/O performance data and job running information) from Sunway TaihuLight, boasting 41508 nodes.
Notice: The data is currently not available.
Samsung Research & UNIST
Metis, a system to automatically finds efficient parallelism plans for distributed training on heterogeneous GPUs.
Balance loads with heterogeneity-awareness; prefer data parallelism over tensor parallelism within a pipeline stage.
Evaluated with three large models (GPT-3, MoE, and Wide-Resnet).
ETH & Google
Dynamically schedule data preprocessing workers on ML accelerator host resources to minimize the number of remote CPU workers needed to achieve peak data ingestion bandwidth.
Analyze the characteristics of input pipelines and automatically reorder transformations to increase data preprocessing worker throughput
SJTU, IPAPS & Huawei Cloud & EPFL
Jiagu, a serverless system based on OpenFaaS
Pre-decision scheduling: decouple prediction and decision-making; predict every function's capacities on a server using a model.
Dual-staged scaling: frequent adjustment of instances.
UVA & George Mason University & Adobe Research
ALPS: Adaptive Learning, Priority Scheduler
Application-aware kernel scheduler
Frontend: user-space; approximate shortest remaining process time (SRPT) priority scheduling by adaptively learning from an SRPT simulation on recent past workload.
Backend: use eBPF functions hooked to CFS to inform scheduling decisions (from the frontend) in the kernel.
HUST & INRIA
One GPU runtime per inference workflow instead of one GPU runtime per function.
Use CUDA streams for serverless inference; fine-grained GPU memory management; and PCIe bandwidth sharing among concurrent streams.
Sungkyunkwan University & Yonsei University & Seoul National University
Enhance performance while maintaining strict data isolation between requests.
The container is reset to an initial state free of any sensitive data after each function request; incorporate a kernel-level memory snapshot management system; optimize runtime by reusing memory regions and leveraging the temporal locality of function executions.
UIUC & IBM Research
Scaling GPU frequency for power saving without SLO attainment violations.
UC Berkeley & UCSB
Distinguished Artifact Award
Run the batch workloads on the private clusters or public cloud. -> Trade-off between the cost and JCT.
Dynamically control jobs' waiting times to improve utilization.
Assign longer waits for large jobs to increase their chances of running on the cluster.
Assign shorter waits to small jobs to increase their chances of running on the cloud.
THU
Generate more complete operator graphs by collecting key runtime information through monitoring program execution.
Provide a reference graph to record program execution states and leverage reference relationships to identify state changes that can impact program outputs.
UCSD & UCSB & Meta & Pacific Northwest National Laboratory
Provide a near-optimal parallelization strategy for embedding tables.
University of Western Australia & HKUST
Fast-PGM: a fast and parallel PGM inference system for importance sampling-based approximate inference algorithms.
Acryl Inc. & Sungkyunkwan University
Offer software-based performance isolation for efficient multi-tenancy in RDMA.
Alibaba & THU & ZJU & PKU
Utilize CXL-attached HDM to build RPC systems.
FastCommit: Resource-Efficient, Performant and Cost-Effective File System Journaling
Best Paper Award
BUPT & UESTC
SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation [] []
Removing Obstacles before Breaking Through the Memory Wall: A Close Look at HBM Errors in the Field [] []
Metis: Fast Automatic Distributed Training on Heterogeneous GPUs []
Pecan: Cost-Efficient ML Data Preprocessing with Automatic Transformation Ordering and Hybrid Placement [] []
Harmonizing Efficiency and Practicability: Optimizing Resource Utilization in Serverless Computing with Jiagu []
ALPS: An Adaptive Learning, Priority OS Scheduler for Serverless Functions [] []
StreamBox: A Lightweight GPU SandBox for Serverless Inference Workflow [] []
A Secure, Fast, and Resource-Efficient Serverless Platform with Function REWIND [] []
Power-aware Deep Learning Model Serving with μ-Serve []
Starburst: A Cost-aware Scheduler for Hybrid Cloud [] []
MAGPY: Compiling Eager Mode DNN Programs by Monitoring Execution States [] []
OPER: Optimality-Guided Embedding Table Parallelization for Large-scale Recommendation Model []
Fast Inference for Probabilistic Graphical Models [] []
PeRF: Preemption-enabled RDMA Framework []
HydraRPC: RPC in the CXL Era []
An Empirical Study of Rust-for-Linux: The Success, Dissatisfaction, and Compromise [] []