# Elastic Parameter Server Load Distribution in Deep Learning Clusters

## Metadata

Presented in [SoCC 2020](https://dl.acm.org/doi/10.1145/3419111.3421307).

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo

## Understanding the paper

A work across HKU and ByteDance.

### Motivation

* Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by **stragglers** for some reasons (e.g., *imbalanced parameter distribution*, *bandwidth contention*, or *computation interference*).
* **Few** studies have investigated **efficient parameter (aka load) distribution** among parameter servers (PS).

### Solution

* Propose a **dynamic** parameter server load distribution scheme called **PSLD**.
  * Mitigate PS straggler issues and accelerate distributed model training.
  * An *exploitation-exploration* method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
  * Implemented on BytePS and vanilla MXNet PS architectures.

![The workflow of PSLD](https://819228986-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MkzeiawY8SkBarQBDVm-659326392%2Fuploads%2Fgit-blob-b8c1fbea9fa876fa35b2131b07ee824b650b5cbe%2Fpsld-workflow.png?alt=media)

Not read the details of the algorithms.
