SC 2024
Meta Info
Homepage: https://sc24.conference-program.com
Paper list: https://dl.acm.org/doi/proceedings/10.5555/3703596
Papers
AI Infrastructure
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning [Paper] [HAI Platform Code]
DeepSeek AI
Include Network Co-Design, HFReduce (collective communication library), HaiScale (optimized parallelism methods), 3FS Distributed File System, and HAI Platform (task scheduling, fault tolerance).
Large Language Models (LLMs)
LLM inference
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [Paper] [Code]
Iowa State University & TU Darmstadt
Continuous Asynchronous Speculation: run single-token inference simultaneously with several speculative runs.
Early Inference Cancellation: skip the computation of invalidated runs.
LLM for anomaly detection
Large Language Models for Anomaly Detection in Computational Workflows: From Supervised Fine-Tuning to In-Context Learning [Paper] [Code] [Benchmark]
Argonne National Laboratory & USC & Oak Ridge National Laboratory
Investigated two approaches: (1) supervised fine-tuning (pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies); (2) in-context learning (prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning).
Mixture-of-Experts (MoEs)
Deep Learning Recommendation Models (DLRMs)
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [Paper] [Code]
Indiana University, Bloomington & Meta & University of Rochester & ICT, CAS
In-depth analysis of embedding data features; employ error-bounded lossy compression to reduce the communication data size.
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link [Paper] [Code]
UC Merced & SK Hynix
TECO: Tensor-CXL-Offload
Introduce a cache coherence interconnect based on CXL to build a cache coherence domain between CPU memory and accelerator memory; offload tensors to CPU memory to save accelerator memory.
Graph Transformer
Reinforcement Learning (RL)
Job Scheduling
Distributed Training
,Optimizing Distributed ML Communication with Fused Computation-Collective Operations [Paper]
AMD
Developed three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures.
Serverless Computing
GPU Sharing
Performance Analysis
GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems [Paper] [Code]
Beihang University
Employ static analysis to identify the performance-critical parameters of kernel functions; segment the program execution with external library calls and asynchronous kernel operations; construct a state transfer graph and estimate the workload of each program segment.
Interconnects
Acronyms
LLM: Large Language Model
MoE: Mixture-of-Experts
DLRM: Deep Learning Recommendation Model
PEFT: Parameter-Efficient Fine-Tuning
MIG: Multi-Instance GPU
MPS: Multi-Process Service
CXL: Compute Express Link
Last updated
Was this helpful?