# A unified architecture for accelerating distributed DNN training in heterogeneous GPU/CPU clusters

## Meta Info

Presented in [OSDI 2020](https://www.usenix.org/conference/osdi20/presentation/jiang).

Authors: Yimin Jiang (*THU & ByteDance*), Yibo Zhu (*ByteDance*), Chang Lan (*Google*), Bairen Yi (*ByteDance*), Yong Cui (*THU*), Chuanxiong Guo (*ByteDance*).

Code: <https://github.com/bytedance/byteps>

## Understanding the paper

This paper presents a distributed DNN training architecture called **BytePS**.

It leverages *spare CPU and bandwidth resources* in the cluster to accelerate distributed DNN training tasks.

It outperforms the state-of-the-art open-source *all-reduce* and *PS*.

### Observation

* There are *spare CPUs and bandwidth* in production GPU clusters.
* Existing *all-reduce* and *PS* architectures are insufficient.
  * They do not utilize additional CPU and bandwidth resources well.

### Architecture

* Two main modules
  * *Summation Service (SS)*
    * Receive tensors from CS; sum them up; send them back to CS.
  * *Communication Service (CS)*
    * (Internally) synchronize the tensors among the local GPUs
    * (Externally) communicate with SS

<figure><img src="/files/7MaJ69YBbSJLnzenL7cL" alt=""><figcaption><p>BytePS architecture.</p></figcaption></figure>

### Communication design

* Inter-machine communication
  * The summation workload determines the *network traffic*.
  * Partition the tensors into small parts no longer than 4MB.
  * Use RDMA RoCEv2.
  * Each machine has *one 100GbE NIC*.
* Intra-machine communication
  * PCIe-only topology
    * *CPU-assisted aggregation*
      * Let GPUs under the same PCIe switch sum the tensors.
      * Copy to CPU and let CPU do the global summation.
      * Broadcast back the global sum.
    * ![PCIe-only machine topology and BytePS data flow.](/files/UKlZCRVy8qoqVPTweZVP)
  * NVLink-based topology
    * *Reduce* tensors from all GPUs to *GPU2*.
    * Copy the results to CPU memory from *GPU2*.
    * After CS gets the aggregated results from SS, *GPU2* copy the data into GPU memory and broadcast them to other GPUs.
    * Prevent GPUs from using *the P0 - CPU0 bandwidth* for communication, so *the NIC can run to full 100Gbps bandwidth*.
    * ![NVLink-based machine topology and BytePS data flow.](/files/PPdwQw9p6Y0N5hx75pyP)
  * **Two principles**
    * Avoid direct GPU-to-GPU memory copy when the two GPUs are *not* under the same PCIe switch.
    * Minimize traffic on the PCIe switch to the CPU link that is *shared by GPUs and NIC*.
* GPUDirect RDMA (GDR)
  * Limitation
    * GDR *requires the GPU and the RDMA NIC to be on the same PCIe switch*, otherwise, the throughput can be less than 50Gbps even with 100GbE NIC.
    * *PCIe-only topology* cannot meet the requirements, while *NVLink-based topology* has avoided any PCIe bottlenecks.
    * Most clouds like AWS do not support GDR.
  * Result: BytePS *does not use GDR for now*.

### Notes

> The CPU machines may not necessarily be actual CPU-only machines. For example, our in-house cluster scheduler can *allocate CPUs on the GPU machines* that run non-distributed jobs and have spare CPU cores and network bandwidth. This improves the overall cluster resource utilization.

> The problem is that the topology is not symmetric considering the NIC, which is connected to only one (out of four) PCIe switch.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/osdi-2020/byteps.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
