Elastic Parameter Server Load Distribution in Deep Learning Clusters

Metadata

Presented in SoCC 2020.

Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo

Understanding the paper

A work across HKU and ByteDance.

Motivation

  • Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).

  • Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).

Solution

  • Propose a dynamic parameter server load distribution scheme called PSLD.

    • Mitigate PS straggler issues and accelerate distributed model training.

    • An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.

    • Implemented on BytePS and vanilla MXNet PS architectures.

Not read the details of the algorithms.

Last updated