English

Network-accelerated Distributed Machine Learning Using MLFabric

Distributed, Parallel, and Cluster Computing 2019-07-02 v1

Abstract

Existing distributed machine learning (DML) systems focus on improving the computational efficiency of distributed learning, whereas communication aspects have received less attention. Many DML systems treat the network as a blackbox. Thus, DML algorithms' performance is impeded by network bottlenecks, and DML systems end up sacrificing important algorithmic and system-level benefits. We present MLfabric, a communication library that manages all network transfers in a DML system, and holistically determines the communication pattern of a DML algorithm at any point in time. This allows MLfabric to carefully order transfers (i.e., gradient updates) to improve convergence, opportunistically aggregate updates in-network to improve efficiency, and proactively replicate some of them to support new notions of fault tolerance. We empirically find that MLfabric achieves up to 3X speed-up in training large deep learning models in realistic dynamic cluster settings.

Keywords

Cite

@article{arxiv.1907.00434,
  title  = {Network-accelerated Distributed Machine Learning Using MLFabric},
  author = {Raajay Viswanathan and Aditya Akella},
  journal= {arXiv preprint arXiv:1907.00434},
  year   = {2019}
}