A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters

#communication_framework #parameter_server #all-reduce #RoCEv2 #heterogeneous_environment #distributed_deep_learning_training

Meta Info

Presented in OSDI 2020.

Authors: Yimin Jiang (THU & ByteDance), Yibo Zhu (ByteDance), Chang Lan (Google), Bairen Yi (ByteDance), Yong Cui (THU), Chuanxiong Guo (ByteDance).

Code: https://github.com/bytedance/byteps

Understanding the paper

This paper presents a distributed DNN training architecture called BytePS.

It leverages spare CPU and bandwidth resources in the cluster to accelerate distributed DNN training tasks.

It outperforms the state-of-the-art open-source all-reduce and PS.

Observation

  • There are spare CPUs and bandwidth in production GPU clusters.

  • Existing all-reduce and PS architectures are insufficient.

    • They do not utilize additional CPU and bandwidth resources well.

Architecture

  • Two main modules

    • Summation Service (SS)

      • Receive tensors from CS; sum them up; send them back to CS.

    • Communication Service (CS)

      • (Internally) synchronize the tensors among the local GPUs

      • (Externally) communicate with SS

Communication design

  • Inter-machine communication

    • The summation workload determines the network traffic.

    • Partition the tensors into small parts no longer than 4MB.

    • Use RDMA RoCEv2.

    • Each machine has one 100GbE NIC.

  • Intra-machine communication

    • PCIe-only topology

      • CPU-assisted aggregation

        • Let GPUs under the same PCIe switch sum the tensors.

        • Copy to CPU and let CPU do the global summation.

        • Broadcast back the global sum.

      <figure><img src="../../../.gitbook/assets/pcie-only-machine-topology-and-byteps-data-flow.png" alt=""><figcaption><p>PCIe-only machine topology and BytePS data flow.</p></figcaption></figure>
    • NVLink-based topology

      • Reduce tensors from all GPUs to GPU2.

      • Copy the results to CPU memory from GPU2.

      • After CS gets the aggregated results from SS, GPU2 copy the data into GPU memory and broadcast them to other GPUs.

      • Prevent GPUs from using the P0 - CPU0 bandwidth for communication, so the NIC can run to full 100Gbps bandwidth.

      <figure><img src="../../../.gitbook/assets/nvlink-based-machine-topology-and-byteps-data-flow.png.png" alt=""><figcaption><p>NVLink-based machine topology and BytePS data flow.</p></figcaption></figure>
    • Two principles

      • Avoid direct GPU-to-GPU memory copy when the two GPUs are not under the same PCIe switch.

      • Minimize traffic on the PCIe switch to the CPU link that is shared by GPUs and NIC.

  • GPUDirect RDMA (GDR)

    • Limitation

      • GDR requires the GPU and the RDMA NIC to be on the same PCIe switch, otherwise, the throughput can be less than 50Gbps even with 100GbE NIC.

      • PCIe-only topology cannot meet the requirements, while NVLink-based topology has avoided any PCIe bottlenecks.

      • Most clouds like AWS do not support GDR.

    • Result: BytePS does not use GDR for now.

Notes

The CPU machines may not necessarily be actual CPU-only machines. For example, our in-house cluster scheduler can allocate CPUs on the GPU machines that run non-distributed jobs and have spare CPU cores and network bandwidth. This improves the overall cluster resource utilization.

The problem is that the topology is not symmetric considering the NIC, which is connected to only one (out of four) PCIe switch.

Last updated