Deep Learning Training

Elastic Training

  • EasyScale: Elastic Training with Consistent Accuracy and Improved Utilization on GPUs (SC 2023) [Paper] [Code]

    • BUAA & Alibaba

Parallelism

  • Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency (SC 2023) [Paper] [Code]

    • NUS

  • Supporting Very Large Models using Automatic Dataflow Graph Partitioning (EuroSys 2019) [Paper]

    • NYU

    • Tofu: Automatic partition a dataflow graph of fine-grained tensor operations.

  • One weird trick for parallelizing convolutional neural networks (arXiv 1404.599) [Paper]

    • Google

    • Data parallelism for convolutional layers; model parallelism for fully-connected layers.

Optimizing Network Communication

  • A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (OSDI 2020) [Personal Notes] [Paper] [Code]

    • THU & ByteDance

    • BytePS: Communication framework

    • Leverage spare CPU and bandwidth resources

    • Consider network topology

Reduce GPU Memory Footprints

GPU Sharing

  • Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC 2021) [Personal Notes] [Paper]

    • UNIST & Ajou & Alibaba & KAIST

    • Reduce the overall GPU consumption for co-located DNN training jobs

    • Utilize NVIDIA MPS

  • Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) [Paper] [Code]

    • UMich SymbioticLab

    • Fine-grained GPU sharing; customized TensorFlow.

  • Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 2018) [Paper]

    • MSRA

    • Time slicing; suspend and resume; mini-batch granularity.

Tensor Swapping / Recomputation

  • SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping (ASPLOS 2020) [Paper]

    • NYU

    • Tensor swapping

    • Consider both GPU memory allocation and operator scheduling

  • Capuchin: Tensor-based GPU Memory Management for Deep Learning (ASPLOS 2020) [Paper]

    • HUST & MSRA & USC

    • Combination of tensor swapping and recomputation.

  • Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization (MLSys 2020) [Paper] [Code]

    • UC Berkeley

    • Define tensor recomputation as an optimization problem.

  • SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks (PPoPP 2018) [Paper]

    • Brown & UESTC & Los Alamos National Laboratory & Pacific Northwest National Laboratory & MIT

    • Cost-aware recomputation

    • Remove the convolutional layer tensor with low computational overhead

  • vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design (MICRO 2016) [Paper]

    • NVIDIA

    • Predictively swap tensors to overlap the CPU-GPU communication time.

  • Training Deep Nets with Sublinear Memory Cost (arXiv 1604.06174) [Personal Notes] [Paper] [Code]

    • UW & Dato Inc. & MIT

    • Memory Monger

    • Sublinear memory cost; trade computation for memory.

Compression

  • Echo: Compiler-based GPU Memory Footprint Reduction for LSTM RNN Training (ISCA 2020) [Paper]

    • UofT

    • LSTM RNN training

  • Gist: Efficient Data Encoding for Deep Neural Network Training (ISCA 2018) [Paper]

    • MSR & UMich & UofT

    • Data encoding

Last updated