Elastic Parameter Server Load Distribution in Deep Learning Clusters
Last updated
Last updated
Presented in SoCC 2020.
Authors: Yangrui Chen, Yanghua Peng, Yixin Bao, Chuan Wu, Yibo Zhu, Chuanxiong Guo
A work across HKU and ByteDance.
Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).
Propose a dynamic parameter server load distribution scheme called PSLD.
Mitigate PS straggler issues and accelerate distributed model training.
An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
Implemented on BytePS and vanilla MXNet PS architectures.
Not read the details of the algorithms.