Related papers: How to scale distributed deep learning?

Layered SGD: A Decentralized and Synchronous SGD Algorithm for Scalable Deep Neural Network Training

Stochastic Gradient Descent (SGD) is the most popular algorithm for training deep neural networks (DNNs). As larger networks and datasets cause longer training times, training on distributed systems is common and distributed SGD variants,…

Machine Learning · Computer Science 2019-06-17 Kwangmin Yu , Thomas Flynn , Shinjae Yoo , Nicholas D'Imperio

Gossip training for deep learning

We address the issue of speeding up the training of convolutional networks. Here we study a distributed method adapted to stochastic gradient descent (SGD). The parallel optimization setup uses several threads, each applying individual…

Computer Vision and Pattern Recognition · Computer Science 2016-11-30 Michael Blot , David Picard , Matthieu Cord , Nicolas Thome

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-02 Dipankar Das , Sasikanth Avancha , Dheevatsa Mudigere , Karthikeyan Vaidynathan , Srinivas Sridharan , Dhiraj Kalamkar , Bharat Kaul , Pradeep Dubey

GoSGD: Distributed Optimization for Deep Learning with Gossip Exchange

We address the issue of speeding up the training of convolutional neural networks by studying a distributed method adapted to stochastic gradient descent. Our parallel optimization setup uses several threads, each applying individual…

Machine Learning · Computer Science 2018-11-13 Michael Blot , David Picard , Matthieu Cord

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

This paper presents a theoretical analysis and practical evaluation of the main bottlenecks towards a scalable distributed solution for the training of Deep Neuronal Networks (DNNs). The presented results show, that the current state of the…

Computer Vision and Pattern Recognition · Computer Science 2016-12-06 Janis Keuper , Franz-Josef Pfreundt

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the…

Machine Learning · Computer Science 2020-09-22 Negar Foroutan Eghlidi , Martin Jaggi

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly…

Machine Learning · Computer Science 2020-05-15 Yemao Xu , Dezun Dong , Weixia Xu , Xiangke Liao

HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring

While distributed training significantly speeds up the training process of the deep neural network (DNN), the utilization of the cluster is relatively low due to the time-consuming data synchronizing between workers. To alleviate this…

Machine Learning · Computer Science 2020-12-01 Yuhao Zhou , Qing Ye , Hailun Zhang , Jiancheng Lv

Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training…

Computation and Language · Computer Science 2021-10-22 Xiaodong Cui , Wei Zhang , Abdullah Kayi , Mingrui Liu , Ulrich Finkler , Brian Kingsbury , George Saon , David Kung

Distributed SGD Generalizes Well Under Asynchrony

The performance of fully synchronized distributed systems has faced a bottleneck due to the big data trend, under which asynchronous distributed systems are becoming a major popularity due to their powerful scalability. In this paper, we…

Machine Learning · Statistics 2019-10-01 Jayanth Regatti , Gaurav Tendolkar , Yi Zhou , Abhishek Gupta , Yingbin Liang

Scaling up Stochastic Gradient Descent for Non-convex Optimisation

Stochastic gradient descent (SGD) is a widely adopted iterative method for optimizing differentiable objective functions. In this paper, we propose and discuss a novel approach to scale up SGD in applications involving non-convex functions…

Machine Learning · Statistics 2022-10-07 Saad Mohamad , Hamad Alamri , Abdelhamid Bouchachia

Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep Learning

Distributed training algorithms of deep neural networks show impressive convergence speedup properties on very large problems. However, they inherently suffer from communication related slowdowns and communication topology becomes a crucial…

Machine Learning · Computer Science 2022-03-25 Tomer Avidor , Nadav Tal Israel

A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning

With huge amounts of training data, deep learning has made great breakthroughs in many artificial intelligence (AI) applications. However, such large-scale data sets present computational challenges, requiring training to be distributed on…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-11-01 Shaohuai Shi , Qiang Wang , Xiaowen Chu , Bo Li

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-02 Qinggang Zhou , Yawen Zhang , Pengcheng Li , Xiaoyong Liu , Jun Yang , Runsheng Wang , Ru Huang

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-08-16 LingFei Dai , Boyu Diao , Chao Li , Yongjun Xu

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads…

Machine Learning · Computer Science 2025-02-10 Cabrel Teguemne Fokam , Khaleelulla Khan Nazeer , Lukas König , David Kappel , Anand Subramoney

Distributed Stochastic Optimization via Adaptive SGD

Stochastic convex optimization algorithms are the most popular way to train machine learning models on large-scale data. Scaling up the training process of these models is crucial, but the most popular algorithm, Stochastic Gradient Descent…

Machine Learning · Statistics 2018-10-30 Ashok Cutkosky , Robert Busa-Fekete

Distributed stochastic optimization for deep learning (thesis)

We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze…

Machine Learning · Computer Science 2016-05-10 Sixin Zhang

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training time of deep learning for large datasets as well as complex models. However, the limited scalability caused by network overheads makes it difficult to synchronize the…

Machine Learning · Computer Science 2022-10-18 Sangho Yeo , Minho Bae , Minjoong Jeong , Oh-kyoung Kwon , Sangyoon Oh

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by…

Machine Learning · Computer Science 2020-03-03 Shaohuai Shi , Zhenheng Tang , Qiang Wang , Kaiyong Zhao , Xiaowen Chu