Related papers: A Highly Efficient Distributed Deep Learning Syste…

Distributed Deep Learning Strategies For Automatic Speech Recognition

In this paper, we propose and investigate a variety of distributed deep learning strategies for automatic speech recognition (ASR) and evaluate them with a state-of-the-art Long short-term memory (LSTM) acoustic model on the 2000-hour…

Sound · Computer Science 2019-04-11 Wei Zhang , Xiaodong Cui , Ulrich Finkler , Brian Kingsbury , George Saon , David Kung , Michael Picheny

Asynchronous Decentralized Distributed Training of Acoustic Models

Large-scale distributed training of deep acoustic models plays an important role in today's high-performance automatic speech recognition (ASR). In this paper we investigate a variety of asynchronous decentralized distributed training…

Computation and Language · Computer Science 2021-10-22 Xiaodong Cui , Wei Zhang , Abdullah Kayi , Mingrui Liu , Ulrich Finkler , Brian Kingsbury , George Saon , David Kung

Improving Efficiency in Large-Scale Decentralized Distributed Training

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchronous Parallel SGD (AD-PSGD) is a family of distributed learning algorithms that have been demonstrated to perform well for large-scale deep learning tasks. One…

Machine Learning · Computer Science 2020-02-05 Wei Zhang , Xiaodong Cui , Abdullah Kayi , Mingrui Liu , Ulrich Finkler , Brian Kingsbury , George Saon , Youssef Mroueh , Alper Buyuktosunoglu , Payel Das , David Kung , Michael Picheny

Distributed Training of Deep Neural Network Acoustic Models for Automatic Speech Recognition

The past decade has witnessed great progress in Automatic Speech Recognition (ASR) due to advances in deep learning. The improvements in performance can be attributed to both improved models and large-scale training data. Key to training…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-02-26 Xiaodong Cui , Wei Zhang , Ulrich Finkler , George Saon , Michael Picheny , David Kung

Loss Landscape Dependent Self-Adjusting Learning Rates in Decentralized Stochastic Gradient Descent

Distributed Deep Learning (DDL) is essential for large-scale Deep Learning (DL) training. Synchronous Stochastic Gradient Descent (SSGD) 1 is the de facto DDL optimization method. Using a sufficiently large batch size is critical to…

Machine Learning · Computer Science 2021-12-03 Wei Zhang , Mingrui Liu , Yu Feng , Xiaodong Cui , Brian Kingsbury , Yuhai Tu

Asynchronous Decentralized Parallel Stochastic Gradient Descent

Most commonly used distributed machine learning systems are either synchronous or centralized asynchronous. Synchronous algorithms like AllReduce-SGD perform poorly in a heterogeneous environment, while asynchronous algorithms using a…

Optimization and Control · Mathematics 2018-09-26 Xiangru Lian , Wei Zhang , Ce Zhang , Ji Liu

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the…

Machine Learning · Computer Science 2021-01-20 Peng Jiang , Gagan Agrawal

Asynchronous Stochastic Gradient Descent with Decoupled Backpropagation and Layer-Wise Updates

The increasing size of deep learning models has made distributed training across multiple devices essential. However, current methods such as distributed data-parallel training suffer from large communication and synchronization overheads…

Machine Learning · Computer Science 2025-02-10 Cabrel Teguemne Fokam , Khaleelulla Khan Nazeer , Lukas König , David Kappel , Anand Subramoney

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amounts of computation, which is often provided by GPUs or other specific accelerators. To scale out to achieve faster training speed, two update algorithms are mainly…

Machine Learning · Computer Science 2020-05-15 Yemao Xu , Dezun Dong , Weixia Xu , Xiangke Liao

How to scale distributed deep learning?

Training time on large datasets for deep neural networks is the principal workflow bottleneck in a number of important applications of deep learning, such as object classification and detection in automatic driver assistance systems (ADAS).…

Machine Learning · Computer Science 2016-11-15 Peter H. Jin , Qiaochu Yuan , Forrest Iandola , Kurt Keutzer

Staleness-aware Async-SGD for Distributed Deep Learning

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD…

Machine Learning · Computer Science 2016-04-06 Wei Zhang , Suyog Gupta , Xiangru Lian , Ji Liu

A(DP)$^2$SGD: Asynchronous Decentralized Parallel Stochastic Gradient Descent with Differential Privacy

As deep learning models are usually massive and complex, distributed learning is essential for increasing training efficiency. Moreover, in many real-world application scenarios like healthcare, distributed learning can also keep the data…

Machine Learning · Computer Science 2020-08-24 Jie Xu , Wei Zhang , Fei Wang

Distributed stochastic optimization for deep learning (thesis)

We study the problem of how to distribute the training of large-scale deep learning models in the parallel computing environment. We propose a new distributed stochastic optimization method called Elastic Averaging SGD (EASGD). We analyze…

Machine Learning · Computer Science 2016-05-10 Sixin Zhang

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the…

Machine Learning · Computer Science 2018-07-31 Xianyan Jia , Shutao Song , Wei He , Yangzihao Wang , Haidong Rong , Feihu Zhou , Liqiang Xie , Zhenyu Guo , Yuanzhou Yang , Liwei Yu , Tiegang Chen , Guangxiao Hu , Shaohuai Shi , Xiaowen Chu

Layered SGD: A Decentralized and Synchronous SGD Algorithm for Scalable Deep Neural Network Training

Stochastic Gradient Descent (SGD) is the most popular algorithm for training deep neural networks (DNNs). As larger networks and datasets cause longer training times, training on distributed systems is common and distributed SGD variants,…

Machine Learning · Computer Science 2019-06-17 Kwangmin Yu , Thomas Flynn , Shinjae Yoo , Nicholas D'Imperio

Locally Asynchronous Stochastic Gradient Descent for Decentralised Deep Learning

Distributed training algorithms of deep neural networks show impressive convergence speedup properties on very large problems. However, they inherently suffer from communication related slowdowns and communication topology becomes a crucial…

Machine Learning · Computer Science 2022-03-25 Tomer Avidor , Nadav Tal Israel

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-06-02 Qinggang Zhou , Yawen Zhang , Pengcheng Li , Xiaoyong Liu , Jun Yang , Runsheng Wang , Ru Huang

An Effective Training Framework for Light-Weight Automatic Speech Recognition Models

Recent advancement in deep learning encouraged developing large automatic speech recognition (ASR) models that achieve promising results while ignoring computational and memory constraints. However, deploying such models on low resource…

Computer Vision and Pattern Recognition · Computer Science 2025-05-29 Abdul Hannan , Alessio Brutti , Shah Nawaz , Mubashir Noman

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

We design and implement a distributed multinode synchronous SGD algorithm, without altering hyper parameters, or compressing data, or altering algorithmic behavior. We perform a detailed analysis of scaling, and identify optimal design…

Distributed, Parallel, and Cluster Computing · Computer Science 2016-03-02 Dipankar Das , Sasikanth Avancha , Dheevatsa Mudigere , Karthikeyan Vaidynathan , Srinivas Sridharan , Dhiraj Kalamkar , Bharat Kaul , Pradeep Dubey

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

When using large-batch training to speed up stochastic gradient descent, learning rates must adapt to new batch sizes in order to maximize speed-ups and preserve model quality. Re-tuning learning rates is resource intensive, while fixed…

Machine Learning · Computer Science 2020-07-13 Tyler B. Johnson , Pulkit Agrawal , Haijie Gu , Carlos Guestrin