# A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters

## Meta Info

Presented in [OSDI 2020](https://www.usenix.org/conference/osdi20/presentation/jiang).

Authors: Yimin Jiang (*THU & ByteDance*), Yibo Zhu (*ByteDance*), Chang Lan (*Google*), Bairen Yi (*ByteDance*), Yong Cui (*THU*), Chuanxiong Guo (*ByteDance*).

Code: <https://github.com/bytedance/byteps>

## Understanding the paper

This paper presents a distributed DNN training architecture called **BytePS**.

It leverages *spare CPU and bandwidth resources* in the cluster to accelerate distributed DNN training tasks.

It outperforms the state-of-the-art open-source *all-reduce* and *PS*.

### Observation

* There are *spare CPUs and bandwidth* in production GPU clusters.
* Existing *all-reduce* and *PS* architectures are insufficient.
  * They do not utilize additional CPU and bandwidth resources well.

### Architecture

* Two main modules
  * *Summation Service (SS)*
    * Receive tensors from CS; sum them up; send them back to CS.
  * *Communication Service (CS)*
    * (Internally) synchronize the tensors among the local GPUs
    * (Externally) communicate with SS

<figure><img src="https://819228986-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MkzeiawY8SkBarQBDVm-659326392%2Fuploads%2F09fat6S4JlOBOmmUmDdr%2Fimage.png?alt=media&#x26;token=4597716b-533a-48b5-9852-32e4145656c1" alt=""><figcaption><p>BytePS architecture.</p></figcaption></figure>

### Communication design

* Inter-machine communication
  * The summation workload determines the *network traffic*.
  * Partition the tensors into small parts no longer than 4MB.
  * Use RDMA RoCEv2.
  * Each machine has *one 100GbE NIC*.
* Intra-machine communication
  * PCIe-only topology
    * *CPU-assisted aggregation*
      * Let GPUs under the same PCIe switch sum the tensors.
      * Copy to CPU and let CPU do the global summation.
      * Broadcast back the global sum.
    * ![PCIe-only machine topology and BytePS data flow.](https://819228986-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MkzeiawY8SkBarQBDVm-659326392%2Fuploads%2F2DrChau3E1YNv1HYZIXG%2Fimage.png?alt=media\&token=8284c3b9-cb2a-4cd2-bbfd-8374ffa77a05)
  * NVLink-based topology
    * *Reduce* tensors from all GPUs to *GPU2*.
    * Copy the results to CPU memory from *GPU2*.
    * After CS gets the aggregated results from SS, *GPU2* copy the data into GPU memory and broadcast them to other GPUs.
    * Prevent GPUs from using *the P0 - CPU0 bandwidth* for communication, so *the NIC can run to full 100Gbps bandwidth*.
    * ![NVLink-based machine topology and BytePS data flow.](https://819228986-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MkzeiawY8SkBarQBDVm-659326392%2Fuploads%2Fp0IDUGSOgOGhgwtQquhl%2Fimage.png?alt=media\&token=0675ef66-dd98-4ba7-aacc-a30805399515)
  * **Two principles**
    * Avoid direct GPU-to-GPU memory copy when the two GPUs are *not* under the same PCIe switch.
    * Minimize traffic on the PCIe switch to the CPU link that is *shared by GPUs and NIC*.
* GPUDirect RDMA (GDR)
  * Limitation
    * GDR *requires the GPU and the RDMA NIC to be on the same PCIe switch*, otherwise, the throughput can be less than 50Gbps even with 100GbE NIC.
    * *PCIe-only topology* cannot meet the requirements, while *NVLink-based topology* has avoided any PCIe bottlenecks.
    * Most clouds like AWS do not support GDR.
  * Result: BytePS *does not use GDR for now*.

### Notes

> The CPU machines may not necessarily be actual CPU-only machines. For example, our in-house cluster scheduler can *allocate CPUs on the GPU machines* that run non-distributed jobs and have spare CPU cores and network bandwidth. This improves the overall cluster resource utilization.

> The problem is that the topology is not symmetric considering the NIC, which is connected to only one (out of four) PCIe switch.
