Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Shijian Li; Robert J. Walls; Tian Guo

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Distributed, Parallel, and Cluster Computing 2020-04-08 v1 Machine Learning Performance

Authors: Shijian Li , Robert J. Walls , Tian Guo

Abstract

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers. In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

Keywords

gpu computing cloud computing distributed computing

Cite

@article{arxiv.2004.03072,
  title  = {Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers},
  author = {Shijian Li and Robert J. Walls and Tian Guo},
  journal= {arXiv preprint arXiv:2004.03072},
  year   = {2020}
}

Comments

11 pages, 12 figures, 5 tables, in proceedings of 40th IEEE International Conference on Distributed Computing Systems (ICDCS) 2020

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Abstract

Keywords

Cite

Comments

Related papers