Distributed Training Large-Scale Deep Architectures

Shang-Xuan Zou; Chun-Yen Chen; Jui-Lin Wu; Chun-Nan Chou; Chia-Chin Tsao; Kuan-Chieh Tung; Ting-Wei Lin; Cheng-Lung Sung; Edward Y. Chang

Distributed Training Large-Scale Deep Architectures

Distributed, Parallel, and Cluster Computing 2017-09-21 v1 Machine Learning Machine Learning

Authors: Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu , Chun-Nan Chou , Chia-Chin Tsao , Kuan-Chieh Tung , Ting-Wei Lin , Cheng-Lung Sung , Edward Y. Chang

View on arXiv ↗ PDF ↗

Abstract

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

Keywords

large language model training parallel algorithm deep learning

Cite

@article{arxiv.1709.06622,
  title  = {Distributed Training Large-Scale Deep Architectures},
  author = {Shang-Xuan Zou and Chun-Yen Chen and Jui-Lin Wu and Chun-Nan Chou and Chia-Chin Tsao and Kuan-Chieh Tung and Ting-Wei Lin and Cheng-Lung Sung and Edward Y. Chang},
  journal= {arXiv preprint arXiv:1709.06622},
  year   = {2017}
}

Distributed Training Large-Scale Deep Architectures

Abstract

Keywords

Cite

Related papers