Related papers: Faster Asynchronous SGD

Slow and Stale Gradients Can Win the Race

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness…

Machine Learning · Statistics 2020-03-25 Sanghamitra Dutta , Jianyu Wang , Gauri Joshi

Distributed Stochastic Gradient Descent with Staleness: A Stochastic Delay Differential Equation Based Framework

Distributed stochastic gradient descent (SGD) has attracted considerable recent attention due to its potential for scaling computational resources, reducing training time, and helping protect user privacy in machine learning. However, the…

Machine Learning · Computer Science 2025-02-27 Siyuan Yu , Wei Chen , H. Vincent Poor

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can…

Machine Learning · Statistics 2018-05-11 Sanghamitra Dutta , Gauri Joshi , Soumyadip Ghosh , Parijat Dube , Priya Nagpurkar

Distributed SGD Generalizes Well Under Asynchrony

The performance of fully synchronized distributed systems has faced a bottleneck due to the big data trend, under which asynchronous distributed systems are becoming a major popularity due to their powerful scalability. In this paper, we…

Machine Learning · Statistics 2019-10-01 Jayanth Regatti , Gaurav Tendolkar , Yi Zhou , Abhishek Gupta , Yingbin Liang

Staleness-aware Async-SGD for Distributed Deep Learning

Deep neural networks have been shown to achieve state-of-the-art performance in several machine learning tasks. Stochastic Gradient Descent (SGD) is the preferred optimization algorithm for training these networks and asynchronous SGD…

Machine Learning · Computer Science 2016-04-06 Wei Zhang , Suyog Gupta , Xiangru Lian , Ji Liu

Making Asynchronous Stochastic Gradient Descent Work for Transformers

Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower…

Computation and Language · Computer Science 2021-11-30 Alham Fikri Aji , Kenneth Heafield

Revisiting Gradient Staleness: Evaluating Distance Metrics for Asynchronous Federated Learning Aggregation

In asynchronous federated learning (FL), client devices send updates to a central server at varying times based on their computational speed, often using stale versions of the global model. This staleness can degrade the convergence and…

Machine Learning · Computer Science 2026-03-10 Patrick Wilhelm , Odej Kao

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient…

Machine Learning · Computer Science 2026-05-14 Ammar Mahran , Artavazd Maranjyan , Peter Richtárik

Optimization Trade-offs in Asynchronous Federated Learning: A Stochastic Networks Approach

Synchronous federated learning scales poorly due to the straggler effect. Asynchronous algorithms increase the update throughput by processing updates upon arrival, but they introduce two fundamental challenges: gradient staleness, which…

Machine Learning · Computer Science 2026-03-30 Abdelkrim Alahyane , Céline Comte , Matthieu Jonckheere

Gap Aware Mitigation of Gradient Staleness

Cloud computing is becoming increasingly popular as a platform for distributed training of deep neural networks. Synchronous stochastic gradient descent (SSGD) suffers from substantial slowdowns due to stragglers if the environment is…

Machine Learning · Computer Science 2020-02-04 Saar Barkai , Ido Hakimi , Assaf Schuster

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is very useful in optimization problems with high-dimensional non-convex target functions, and hence constitutes an important component of several Machine Learning and Data Analytics methods. Recently there…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-11-11 Karl Bäckström , Marina Papatriantafilou , Philippas Tsigas

Convergence Analysis of Decentralized ASGD

Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming…

Machine Learning · Computer Science 2025-12-01 Mauro DL Tosi , Martin Theobald

Asynchronous Parallel Stochastic Gradient Descent - A Numeric Core for Scalable Distributed Machine Learning Algorithms

The implementation of a vast majority of machine learning (ML) algorithms boils down to solving a numerical optimization problem. In this context, Stochastic Gradient Descent (SGD) methods have long proven to provide good results, both in…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-06 Janis Keuper , Franz-Josef Pfreundt

Accumulated Gradient Normalization

This work addresses the instability in asynchronous data parallel optimization. It does so by introducing a novel distributed optimizer which is able to efficiently optimize a centralized model under communication constraints. The optimizer…

Machine Learning · Statistics 2017-10-09 Joeri Hermans , Gerasimos Spanakis , Rico Möckel

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically…

Machine Learning · Computer Science 2026-05-15 Tehila Dahan , Roie Reshef , Sharon Goldstein , Kfir Y. Levy

Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

Distributed gradient descent (DGD) is an efficient way of implementing gradient descent (GD), especially for large data sets, by dividing the computation tasks into smaller subtasks and assigning to different computing servers (CSs) to be…

Information Theory · Computer Science 2018-11-29 Emre Ozfatura , Deniz Gunduz , Sennur Ulukus

HPSGD: Hierarchical Parallel SGD With Stale Gradients Featuring

While distributed training significantly speeds up the training process of the deep neural network (DNN), the utilization of the cluster is relatively low due to the time-consuming data synchronizing between workers. To alleviate this…

Machine Learning · Computer Science 2020-12-01 Yuhao Zhou , Qing Ye , Hailun Zhang , Jiancheng Lv

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that…

Computation and Language · Computer Science 2018-09-17 Nikolay Bogoychev , Marcin Junczys-Dowmunt , Kenneth Heafield , Alham Fikri Aji

Asynchronous Distributed Semi-Stochastic Gradient Optimization

With the recent proliferation of large-scale learning problems,there have been a lot of interest on distributed machine learning algorithms, particularly those that are based on stochastic gradient descent (SGD) and its variants. However,…

Machine Learning · Computer Science 2015-12-07 Ruiliang Zhang , Shuai Zheng , James T. Kwok

Taming Momentum in a Distributed Asynchronous Environment

Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys…

Machine Learning · Computer Science 2020-10-15 Ido Hakimi , Saar Barkai , Moshe Gabel , Assaf Schuster