SC 2024
Last updated
Last updated
Homepage:
Paper list:
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning [] []
DeepSeek AI
Include Network Co-Design, HFReduce (collective communication library), HaiScale (optimized parallelism methods), 3FS Distributed File System, and HAI Platform (task scheduling, fault tolerance).
LLM inference
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation [] []
Iowa State University & TU Darmstadt
Continuous Asynchronous Speculation: run single-token inference simultaneously with several speculative runs.
Early Inference Cancellation: skip the computation of invalidated runs.
LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services [] [] []
IBM Research
Learn a predictive model to recommend the most cost-effective hardware for a previously unseen LLM.
LLM fine-tuning
Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity [] []
MSRA & THU
LLM for anomaly detection
Large Language Models for Anomaly Detection in Computational Workflows: From Supervised Fine-Tuning to In-Context Learning [] [] []
Argonne National Laboratory & USC & Oak Ridge National Laboratory
Investigated two approaches: (1) supervised fine-tuning (pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies); (2) in-context learning (prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning).
SYSU
WHU & NVIDIA & UMacau
EcoRec: eliminate redundancy in TT (Tensor-Train) operations; micro-batching with sorted indices to reduce memory.
Indiana University, Bloomington & Meta & University of Rochester & ICT, CAS
In-depth analysis of embedding data features; employ error-bounded lossy compression to reduce the communication data size.
UC Merced & SK Hynix
TECO: Tensor-CXL-Offload
Introduce a cache coherence interconnect based on CXL to build a cache coherence domain between CPU memory and accelerator memory; offload tensors to CPU memory to save accelerator memory.
RUC & Microsoft & UCSD
Create fused kernels with distinct schedules for different feature fields.
NTU & Shanghai AI Lab & ZJU & SenseTime
Stevens Institute of Technology & NEU & Stony Brook University & Missouri University of Science and Technology
Introduce a generic asynchronous learning paradigm.
UW-Madison
Characterize which applications are more likely to suffer from performance variability; balance performance variability with locality to ensure jobs are spread across as few nodes as possible.
AMD
Developed three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures.
SIAT, CAS & UMacau
Integrate adaptive pre-warming windows; built on top of OpenFaaS.
Chung-Ang University & Electronics and Telecommunications Research Institute & Virginia Tech
Integrate MIG and MPS to enhance GPU utilization.
Beihang University
Employ static analysis to identify the performance-critical parameters of kernel functions; segment the program execution with external library calls and asynchronous kernel operations; construct a state transfer graph and estimate the workload of each program segment.
Sapienza University of Rome & University of Trento & Vrije Universiteit Amsterdam & ETH & CINECA & University of Antwerp & HPE & NVIDIA
Characterize three supercomputers: Alps, Leonardo, and LUMI.
LLM: Large Language Model
MoE: Mixture-of-Experts
DLRM: Deep Learning Recommendation Model
PEFT: Parameter-Efficient Fine-Tuning
MIG: Multi-Instance GPU
MPS: Multi-Process Service
CXL: Compute Express Link
APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes [] []
Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching [] []
Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression [] []
Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link [] []
RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules [] []
TorchGT: A Holistic System for Large-Scale Graph Transformer Training [] []
Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing [] []
PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters [] []
,Optimizing Distributed ML Communication with Fused Computation-Collective Operations []
SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing [] []
ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments [] []
GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems [] []
Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects [] []