# SC 2024

## Meta Info

Homepage: <https://sc24.conference-program.com>

Paper list: <https://dl.acm.org/doi/proceedings/10.5555/3703596>

## Papers

### AI Infrastructure

* Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00089)] \[[HAI Platform Code](https://github.com/HFAiLab/hai-platform)]
  * DeepSeek AI
  * Include Network Co-Design, HFReduce (collective communication library), HaiScale (optimized parallelism methods), 3FS Distributed File System, and HAI Platform (task scheduling, fault tolerance).

### Large Language Models (LLMs)

* LLM inference
  * PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00046)] \[[Code](https://github.com/AutonomicPerfectionist/PipeInfer)]
    * Iowa State University & TU Darmstadt
    * *Continuous Asynchronous Speculation*: run single-token inference simultaneously with several speculative runs.
    * *Early Inference Cancellation*: skip the computation of invalidated runs.
  * LLM-Pilot: Characterize and Optimize Performance of your LLM Inference Services \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00022)] \[[Benchmark](https://github.com/fmperf-project/fmperf)] \[[Code](https://github.com/IBM/LLM-performance-prediction)]
    * IBM Research
    * Learn a predictive model to recommend the most cost-effective hardware for a previously unseen LLM.
* LLM fine-tuning
  * Long Exposure: Accelerating Parameter-Efficient Fine-Tuning for LLMs under Shadowy Sparsity \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00081)] \[[Code](https://github.com/HPHEX/LongExposure)]
    * MSRA & THU
* LLM for anomaly detection
  * Large Language Models for Anomaly Detection in Computational Workflows: From Supervised Fine-Tuning to In-Context Learning \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00098)] \[[Code](https://github.com/PoSeiDon-Workflows/LLM_AD)] \[[Benchmark](https://github.com/PoSeiDon-Workflows/FlowBench)]
    * Argonne National Laboratory & USC & Oak Ridge National Laboratory
    * Investigated two approaches: (1) supervised fine-tuning (pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies); (2) in-context learning (prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning).

### Mixture-of-Experts (MoEs)

* APTMoE: Affinity-Aware Pipeline Tuning for MoE Models on Bandwidth-Constrained GPU Nodes \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00096)] \[[Code](https://github.com/Atopos-309/APTMoE)]
  * SYSU

### Deep Learning Recommendation Models (DLRMs)

* Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00055)] \[[Code](https://doi.org/10.5281/zenodo.13324403)]
  * WHU & NVIDIA & UMacau
  * **EcoRec:** eliminate redundancy in TT (Tensor-Train) operations; micro-batching with sorted indices to reduce memory.
* Accelerating Communication in Deep Learning Recommendation Model Training with Dual-Level Adaptive Lossy Compression \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00095)] \[[Code](https://zenodo.org/records/13119689)]
  * Indiana University, Bloomington & Meta & University of Rochester & ICT, CAS
  * In-depth analysis of embedding data features; employ error-bounded lossy compression to reduce the communication data size.
* Efficient Tensor Offloading for Large Deep-Learning Model Training based on Compute Express Link \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00100)] \[[Code](https://github.com/luckyq/ADSC-24)]
  * UC Merced & SK Hynix
  * **TECO**: Tensor-CXL-Offload
  * Introduce a cache coherence interconnect based on CXL to build a cache coherence domain between CPU memory and accelerator memory; offload tensors to CPU memory to save accelerator memory.
* RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules \[[Paper](https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00047)] \[[Code](https://github.com/PanZaifeng/RecFlex)]
  * RUC & Microsoft & UCSD
  * Create fused kernels with distinct schedules for *different* feature fields.

### Graph Transformer

* TorchGT: A Holistic System for Large-Scale Graph Transformer Training \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00083)] \[[Code](https://github.com/zxmeng98/torchgt)]
  * NTU & Shanghai AI Lab & ZJU & SenseTime

### **Reinforcement Learning (RL)**

* Stellaris: Staleness-Aware Distributed Reinforcement Learning with Serverless Computing \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00045)] \[[Code](https://github.com/IntelliSys-Lab/Stellaris-SC24)]
  * Stevens Institute of Technology & NEU & Stony Brook University & Missouri University of Science and Technology
  * Introduce a generic asynchronous learning paradigm.

### Job Scheduling

* PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU Clusters \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00032)] \[[Code](https://github.com/hal-uw/blox-pal)]
  * UW-Madison
  * Characterize which applications are more likely to suffer from performance variability; balance performance variability with locality to ensure jobs are spread across as few nodes as possible.

### Distributed Training

* ,Optimizing Distributed ML Communication with Fused Computation-Collective Operations \[[Paper](https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00094)]
  * AMD
  * Developed three prototype fused operators (embedding + All-to-All, GEMV + AllReduce, and GEMM + All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures.

### Serverless Computing

* SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00044)] \[[Code](https://github.com/blinkbear/smiless-ad)]
  * SIAT, CAS & UMacau
  * Integrate adaptive pre-warming windows; built on top of OpenFaaS.

### GPU Sharing

* ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00048)] \[[Code](https://github.com/MunQ-Lee/ParvaGPU_SC24)]
  * Chung-Ang University & Electronics and Telecommunications Research Institute & Virginia Tech
  * Integrate MIG and MPS to enhance GPU utilization.

### Performance Analysis

* GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00063)] \[[Code](https://zenodo.org/records/10975567)]
  * Beihang University
  * Employ *static analysis* to identify the performance-critical parameters of kernel functions; segment the program execution with external library calls and asynchronous kernel operations; construct a state transfer graph and estimate the workload of each program segment.

### Interconnects

* Exploring GPU-to-GPU Communication: Insights into Supercomputer Interconnects \[[Paper](https://dl.acm.org/doi/10.1109/SC41406.2024.00039)] \[[Benchmark](https://zenodo.org/records/13312325)]
  * Sapienza University of Rome & University of Trento & Vrije Universiteit Amsterdam & ETH & CINECA & University of Antwerp & HPE & NVIDIA
  * Characterize three supercomputers: Alps, Leonardo, and LUMI.

## Acronyms

* LLM: Large Language Model
* MoE: Mixture-of-Experts
* DLRM: Deep Learning Recommendation Model
* PEFT: Parameter-Efficient Fine-Tuning
* MIG: Multi-Instance GPU
* MPS: Multi-Process Service
* CXL: Compute Express Link


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/sc-2024.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
