Parameter servers (PS) are widely used in distributed DNN training. But their performance will be damaged by stragglers for some reasons (e.g., imbalanced parameter distribution, bandwidth contention, or computation interference).
Few studies have investigated efficient parameter (aka load) distribution among parameter servers (PS).
Solution
Propose a dynamic parameter server load distribution scheme called PSLD.
Mitigate PS straggler issues and accelerate distributed model training.
An exploitation-exploration method is used to 1) scale in and out parameter servers, 2) adjust parameter distribution among PSs.
Implemented on BytePS and vanilla MXNet PS architectures.