# SC 2024

## Meta Info

Homepage: <https://sc24.conference-program.com>

Paper list: <https://dl.acm.org/doi/proceedings/10.5555/3703596>

## Papers

### AI Infrastructure

* Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00089)] \[[HAI Platform Code](https://github.com/HFAiLab/hai-platform)]
  * DeepSeek AI
  * Include Network Co-Design, HFReduce (collective communication library), HaiScale (optimized parallelism methods), 3FS Distributed File System, and HAI Platform (task scheduling, fault tolerance).

### Large Language Models (LLMs)

* LLM inference
  * PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00046)] \[[Code](https://github.com/AutonomicPerfectionist/PipeInfer)]
    * Iowa State University & TU Darmstadt
    * *Continuous Asynchronous Speculation*: run single-token inference simultaneously with several speculative runs.
    * *Early Inference Cancellation*: skip the computation of invalidated runs.
  * LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00022)] \[[Benchmark](https://github.com/fmperf-project/fmperf)] \[[Code](https://github.com/IBM/LLM-performance-prediction)]
    * IBM Research
    * Learn a predictive model to recommend the most cost-effective hardware for a previously unseen LLM.
* LLM fine-tuning
  * Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00081)] \[[Code](https://github.com/HPHEX/LongExposure)]
    * MSRA & THU
* LLM for anomaly detection
  * Large Language Models for Anomaly Detection in Computational Workflows: From Supervised Fine-Tuning to In-Context Learning \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00098)] \[[Code](https://github.com/PoSeiDon-Workflows/LLM_AD)] \[[Benchmark](https://github.com/PoSeiDon-Workflows/FlowBench)]
    * Argonne National Laboratory & USC & Oak Ridge National Laboratory
    * Investigated two approaches: (1) supervised fine-tuning (pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies); (2) in-context learning (prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning).

### Mixture-of-Experts (MoEs)

* APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00096)] \[[Code](https://github.com/Atopos-309/APTMoE)]
  * SYSU

### Deep Learning Recommendation Models (DLRMs)

* Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00055)] \[[Code](https://doi.org/10.5281/zenodo.13324403)]
  * WHU & NVIDIA & UMacau
  * **EcoRec:** eliminate redundancy in TT (Tensor-Train) operations; micro-batching with sorted indices to reduce memory.
* Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00095)] \[[Code](https://zenodo.org/records/13119689)]
  * Indiana University, Bloomington & Meta & University of Rochester & ICT, CAS
  * In-depth analysis of embedding data features; employ error-bounded lossy compression to reduce the communication data size.
* Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00100)] \[[Code](https://github.com/luckyq/ADSC-24)]
  * UC Merced & SK Hynix
  * **TECO**: Tensor-CXL-Offload
  * Introduce a cache coherence interconnect based on CXL to build a cache coherence domain between CPU memory and accelerator memory; offload tensors to CPU memory to save accelerator memory.
* RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules \[[Paper](https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00047)] \[[Code](https://github.com/PanZaifeng/RecFlex)]
  * RUC & Microsoft & UCSD
  * Create fused kernels with distinct schedules for *different* feature fields.

### Graph Transformer

* TorchGT: A Holistic System for Large-Scale Graph Transformer Training \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00083)] \[[Code](https://github.com/zxmeng98/torchgt)]
  * NTU & Shanghai AI Lab & ZJU & SenseTime

### **Reinforcement Learning (RL)**

* Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00045)] \[[Code](https://github.com/IntelliSys-Lab/Stellaris-SC24)]
  * Stevens Institute of Technology & NEU & Stony Brook University & Missouri University of Science and Technology
  * Introduce a generic asynchronous learning paradigm.

### Job Scheduling

* PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00032)] \[[Code](https://github.com/hal-uw/blox-pal)]
  * UW-Madison
  * Characterize which applications are more likely to suffer from performance variability; balance performance variability with locality to ensure jobs are spread across as few nodes as possible.

### Distributed Training

* ,Optimizing Distributed ML Communication with Fused Computation-Collective Operations \[[Paper](https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00094)]
  * AMD
  * Developed three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures.

### Serverless Computing

* SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00044)] \[[Code](https://github.com/blinkbear/smiless-ad)]
  * SIAT, CAS & UMacau
  * Integrate adaptive pre-warming windows; built on top of OpenFaaS.

### GPU Sharing

* ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00048)] \[[Code](https://github.com/MunQ-Lee/ParvaGPU_SC24)]
  * Chung-Ang University & Electronics and Telecommunications Research Institute & Virginia Tech
  * Integrate MIG and MPS to enhance GPU utilization.

### Performance Analysis

* GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00063)] \[[Code](https://zenodo.org/records/10975567)]
  * Beihang University
  * Employ *static analysis* to identify the performance-critical parameters of kernel functions; segment the program execution with external library calls and asynchronous kernel operations; construct a state transfer graph and estimate the workload of each program segment.

### Interconnects

* Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00039)] \[[Benchmark](https://zenodo.org/records/13312325)]
  * Sapienza University of Rome & University of Trento & Vrije Universiteit Amsterdam & ETH & CINECA & University of Antwerp & HPE & NVIDIA
  * Characterize three supercomputers: Alps, Leonardo, and LUMI.

## Acronyms

* LLM: Large Language Model
* MoE: Mixture-of-Experts
* DLRM: Deep Learning Recommendation Model
* PEFT: Parameter-Efficient Fine-Tuning
* MIG: Multi-Instance GPU
* MPS: Multi-Process Service
* CXL: Compute Express Link
