Related papers: Delay-adaptive step-sizes for asynchronous learnin…

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more…

Machine Learning · Computer Science 2020-02-14 Niv Giladi , Mor Shpigel Nacson , Elad Hoffer , Daniel Soudry

On the Convergence of Asynchronous Parallel Iteration with Unbounded Delays

Recent years have witnessed the surge of asynchronous parallel (async-parallel) iterative algorithms due to problems involving very large-scale data and a large number of decision variables. Because of asynchrony, the iterates are computed…

Optimization and Control · Mathematics 2021-02-05 Zhimin Peng , Yangyang Xu , Ming Yan , Wotao Yin

Distributed Delayed Stochastic Optimization

We analyze the convergence of gradient-based optimization algorithms that base their updates on delayed stochastic gradient information. The main application of our results is to the development of gradient-based distributed optimization…

Optimization and Control · Mathematics 2011-05-02 Alekh Agarwal , John C. Duchi

Learning Under Delayed Feedback: Implicitly Adapting to Gradient Delays

We consider stochastic convex optimization problems, where several machines act asynchronously in parallel while sharing a common memory. We propose a robust training method for the constrained setting and derive non asymptotic convergence…

Machine Learning · Computer Science 2021-06-24 Rotem Zamir Aviv , Ido Hakimi , Assaf Schuster , Kfir Y. Levy

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to…

Machine Learning · Computer Science 2020-06-12 Mahdi S. Hosseini , Konstantinos N. Plataniotis

Delay-agnostic Asynchronous Distributed Optimization

Existing asynchronous distributed optimization algorithms often use diminishing step-sizes that cause slow practical convergence, or fixed step-sizes that depend on an assumed upper bound of delays. Not only is such a delay bound hard to…

Optimization and Control · Mathematics 2023-08-24 Xuyang Wu , Changxin Liu , Sindri Magnusson , Mikael Johansson

On Unbounded Delays in Asynchronous Parallel Fixed-Point Algorithms

The need for scalable numerical solutions has motivated the development of asynchronous parallel algorithms, where a set of nodes run in parallel with little or no synchronization, thus computing with delayed information. This paper studies…

Optimization and Control · Mathematics 2017-08-18 Robert Hannah , Wotao Yin

Speed learning on the fly

The practical performance of online stochastic gradient descent algorithms is highly dependent on the chosen step size, which must be tediously hand-tuned in many applications. The same is true for more advanced variants of stochastic…

Optimization and Control · Mathematics 2015-11-10 Pierre-Yves Massé , Yann Ollivier

Optimal Linear Decay Learning Rate Schedules and Further Refinements

Learning rate schedules used in practice bear little resemblance to those recommended by theory. We close much of this theory/practice gap, and as a consequence are able to derive new problem-adaptive learning rate schedules. Our main…

Machine Learning · Computer Science 2024-10-31 Aaron Defazio , Ashok Cutkosky , Harsh Mehta , Konstantin Mishchenko

On the Convergence of Step Decay Step-Size for Stochastic Optimization

The convergence of stochastic gradient descent is highly dependent on the step-size, especially on non-convex problems such as neural network training. Step decay step-size schedules (constant and then cut) are widely used in practice…

Optimization and Control · Mathematics 2021-02-19 Xiaoyu Wang , Sindri Magnússon , Mikael Johansson

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically…

Machine Learning · Computer Science 2026-05-15 Tehila Dahan , Roie Reshef , Sharon Goldstein , Kfir Y. Levy

Asymptotic Convergence in Online Learning with Unbounded Delays

We study the problem of predicting the results of computations that are too expensive to run, via the observation of the results of smaller computations. We model this as an online learning problem with delayed feedback, where the length of…

Machine Learning · Computer Science 2016-09-08 Scott Garrabrant , Nate Soares , Jessica Taylor

Learning-Based Sensor Scheduling for Delay-Aware and Stable Remote State Estimation

Unpredictable sensor-to-estimator delays fundamentally distort what matters for wireless remote state estimation: not just freshness, but how delay interacts with sensor informativeness and energy efficiency. In this paper, we present a…

Information Theory · Computer Science 2026-01-30 Nho-Duc Tran , Aamir Mahmood , Mikael Gidlund

Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness. Especially, it has been…

Machine Learning · Computer Science 2021-03-04 Sebastian U. Stich , Amirkeivan Mohtashami , Martin Jaggi

Learn one size to infer all: Exploiting translational symmetries in delay-dynamical and spatio-temporal systems using scalable neural networks

We design scalable neural networks adapted to translational symmetries in dynamical systems, capable of inferring untrained high-dimensional dynamics for different system sizes. We train these networks to predict the dynamics of…

Machine Learning · Computer Science 2024-07-08 Mirko Goldmann , Claudio R. Mirasso , Ingo Fischer , Miguel C. Soriano

On the Positive Effect of Delay on the Rate of Convergence of a Class of Linear Time-Delayed Systems

This paper is a comprehensive study of a long observed phenomenon of increase in the stability margin and so the rate of convergence of a class of linear systems due to time delay. We use Lambert W function to determine (a) in what systems…

Multiagent Systems · Computer Science 2019-07-23 Hossein Moradian , Solmaz S. Kia

Starting Small -- Learning with Adaptive Sample Sizes

For many machine learning problems, data is abundant and it may be prohibitive to make multiple passes through the full training set. In this context, we investigate strategies for dynamically increasing the effective sample size, when…

Machine Learning · Computer Science 2016-10-10 Hadi Daneshmand , Aurelien Lucchi , Thomas Hofmann

Convex and Non-convex Federated Learning with Stale Stochastic Gradients: Diminishing Step Size is All You Need

We propose a general framework for distributed stochastic optimization under delayed gradient models. In this setting, $n$ local agents leverage their own data and computation to assist a central server in minimizing a global objective…

Optimization and Control · Mathematics 2026-03-04 Xinran Zheng , Tara Javidi , Behrouz Touri

Optimal convergence rates of totally asynchronous optimization

Asynchronous optimization algorithms are at the core of modern machine learning and resource allocation systems. However, most convergence results consider bounded information delays and several important algorithms lack guarantees when…

Optimization and Control · Mathematics 2022-03-10 Xuyang Wu , Sindri Magnusson , Hamid Reza Feyzmahdavian , Mikael Johansson

On the Convergence of Federated Learning Algorithms without Data Similarity

Data similarity assumptions have traditionally been relied upon to understand the convergence behaviors of federated learning methods. Unfortunately, this approach often demands fine-tuning step sizes based on the level of data similarity.…

Machine Learning · Computer Science 2025-01-14 Ali Beikmohammadi , Sarit Khirirat , Sindri Magnússon