A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters
#communication_framework #parameter_server #all-reduce #RoCEv2 #heterogeneous_environment #distributed_deep_learning_training
Meta Info
Presented in OSDI 2020.
Authors: Yimin Jiang (THU & ByteDance), Yibo Zhu (ByteDance), Chang Lan (Google), Bairen Yi (ByteDance), Yong Cui (THU), Chuanxiong Guo (ByteDance).
Code: https://github.com/bytedance/byteps
Understanding the paper
This paper presents a distributed DNN training architecture called BytePS.
It leverages spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks.
It outperforms the state-of-the-art open-source all-reduce and PS.
Observation
There are spare CPUs and bandwidth in production GPU clusters.
Existing all-reduce and PS architectures are insufficient.
They do not utilize additional CPU and bandwidth resources well.
Architecture
Two main modules
Summation Service (SS)
Receive tensors from CS; sum them up; send them back to CS.
Communication Service (CS)
(Internally) synchronize the tensors among the local GPUs
(Externally) communicate with SS
Communication design
Inter-machine communication
The summation workload determines the network traffic.
Partition the tensors into small parts no longer than 4MB.
Use RDMA RoCEv2.
Each machine has one 100GbE NIC.
Intra-machine communication
PCIe-only topology
CPU-assisted aggregation
Let GPUs under the same PCIe switch sum the tensors.
Copy to CPU and let CPU do the global summation.
Broadcast back the global sum.
NVLink-based topology
Reduce tensors from all GPUs to GPU2.
Copy the results to CPU memory from GPU2.
After CS gets the aggregated results from SS, GPU2 copy the data into GPU memory and broadcast them to other GPUs.
Prevent GPUs from using the P0 - CPU0 bandwidth for communication, so the NIC can run to full 100Gbps bandwidth.
Two principles
Avoid direct GPU-to-GPU memory copy when the two GPUs are not under the same PCIe switch.
Minimize traffic on the PCIe switch to the CPU link that is shared by GPUs and NIC.
GPUDirect RDMA (GDR)
Limitation
GDR requires the GPU and the RDMA NIC to be on the same PCIe switch, otherwise, the throughput can be less than 50Gbps even with 100GbE NIC.
PCIe-only topology cannot meet the requirements, while NVLink-based topology has avoided any PCIe bottlenecks.
Most clouds like AWS do not support GDR.
Result: BytePS does not use GDR for now.
Notes
The CPU machines may not necessarily be actual CPU-only machines. For example, our in-house cluster scheduler can allocate CPUs on the GPU machines that run non-distributed jobs and have spare CPU cores and network bandwidth. This improves the overall cluster resource utilization.
The problem is that the topology is not symmetric considering the NIC, which is connected to only one (out of four) PCIe switch.
Last updated