# Elastic Parameter Server Load Distribution in Deep Learning Clusters

## Metadata

Presented in [SoCC 2020](https://dl.acm.org/doi/10.1145/3419111.3421307).

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo

## Understanding the paper

A work across HKU and ByteDance.

### Motivation

* Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by **stragglers** for some reasons (e.g., *imbalanced parameter distribution*, *bandwidth contention*, or *computation interference*).
* **Few** studies have investigated **efficient parameter (aka load) distribution** among parameter servers (PS).

### Solution

* Propose a **dynamic** parameter server load distribution scheme called **PSLD**.
  * Mitigate PS straggler issues and accelerate distributed model training.
  * An *exploitation-exploration* method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
  * Implemented on BytePS and vanilla MXNet PS architectures.

![The workflow of PSLD](/files/3wM02fpvukolyWzUmTBT)

Not read the details of the algorithms.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://paper.lingyunyang.com/reading-notes/conference/socc-2020/elastic-parameter-server-load-distribution-in-deep-learning-clusters.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
