Deep Learning Training
Elastic Training
Parallelism
Supporting Very Large Models using Automatic Dataflow Graph Partitioning (EuroSys 2019) [Paper]
NYU
Tofu: Automatic partition a dataflow graph of fine-grained tensor operations.
One weird trick for parallelizing convolutional neural networks (arXiv 1404.599) [Paper]
Google
Data parallelism for convolutional layers; model parallelism for fully-connected layers.
Optimizing Network Communication
A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters (OSDI 2020) [Personal Notes] [Paper] [Code]
THU & ByteDance
BytePS: Communication framework
Leverage spare CPU and bandwidth resources
Consider network topology
Reduce GPU Memory Footprints
GPU Sharing
Zico: Efficient GPU Memory Sharing for Concurrent DNN Training (ATC 2021) [Personal Notes] [Paper]
UNIST & Ajou & Alibaba & KAIST
Reduce the overall GPU consumption for co-located DNN training jobs
Utilize NVIDIA MPS
Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications (MLSys 2020) [Paper] [Code]
UMich SymbioticLab
Fine-grained GPU sharing; customized TensorFlow.
Tensor Swapping / Recomputation
SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping (ASPLOS 2020) [Paper]
NYU
Tensor swapping
Consider both GPU memory allocation and operator scheduling
Capuchin: Tensor-based GPU Memory Management for Deep Learning (ASPLOS 2020) [Paper]
HUST & MSRA & USC
Combination of tensor swapping and recomputation.
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization (MLSys 2020) [Paper] [Code]
UC Berkeley
Define tensor recomputation as an optimization problem.
SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks (PPoPP 2018) [Paper]
Brown & UESTC & Los Alamos National Laboratory & Pacific Northwest National Laboratory & MIT
Cost-aware recomputation
Remove the convolutional layer tensor with low computational overhead
vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design (MICRO 2016) [Paper]
NVIDIA
Predictively swap tensors to overlap the CPU-GPU communication time.
Training Deep Nets with Sublinear Memory Cost (arXiv 1604.06174) [Personal Notes] [Paper] [Code]
UW & Dato Inc. & MIT
Memory Monger
Sublinear memory cost; trade computation for memory.
Compression
Last updated