Related papers: Adaptive Braking for Mitigating Gradient Delay

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly common to train models in parallel. However, existing distributed stochastic gradient descent (SGD) algorithms suffer from insufficient utilization of computational…

Machine Learning · Computer Science 2023-08-30 Xin Zhou , Ling Chen , Houming Wu

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Stochastic optimization plays a crucial role in the advancement of deep learning technologies. Over the decades, significant effort has been dedicated to improving the training efficiency and robustness of deep neural networks, via various…

Machine Learning · Computer Science 2024-08-21 Huixiu Jiang , Ling Yang , Yu Bao , Rutong Si , Sikun Yang

Adaptive Gradient Method with Resilience and Momentum

Several variants of stochastic gradient descent (SGD) have been proposed to improve the learning effectiveness and efficiency when training deep neural networks, among which some recent influential attempts would like to adaptively control…

Machine Learning · Computer Science 2020-10-22 Jie Liu , Chen Lin , Chuming Li , Lu Sheng , Ming Sun , Junjie Yan , Wanli Ouyang

Dynamic Batch Adaptation

Current deep learning adaptive optimizer methods adjust the step magnitude of parameter updates by altering the effective learning rate used by each parameter. Motivated by the known inverse relation between batch size and learning rate on…

Machine Learning · Computer Science 2022-08-02 Cristian Simionescu , George Stoica , Robert Herscovici

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?

Background: Recent developments have made it possible to accelerate neural networks training significantly using large batch sizes and data parallelism. Training in an asynchronous fashion, where delay occurs, can make training even more…

Machine Learning · Computer Science 2020-02-14 Niv Giladi , Mor Shpigel Nacson , Elad Hoffer , Daniel Soudry

Bringing Order to Asynchronous SGD: Towards Optimality under Data-Dependent Delays with Momentum

Asynchronous stochastic gradient descent (SGD) enables scalable distributed training but suffers from gradient staleness. Existing mitigation strategies, such as delay-adaptive learning rates and staleness-aware filtering, typically…

Machine Learning · Computer Science 2026-05-15 Tehila Dahan , Roie Reshef , Sharon Goldstein , Kfir Y. Levy

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD)…

Machine Learning · Computer Science 2020-06-24 Jinghui Chen , Dongruo Zhou , Yiqi Tang , Ziyan Yang , Yuan Cao , Quanquan Gu

Adaptive Momentum and Nonlinear Damping for Neural Network Training

We propose a continuous-time scheme for large-scale optimization that introduces individual, adaptive momentum coefficients regulated by the kinetic energy of each model parameter. This approach automatically adjusts to local landscape…

Machine Learning · Computer Science 2026-02-03 Aikaterini Karoni , Rajit Rajpal , Benedict Leimkuhler , Gabriel Stoltz

Stochastic Gradient Descent with Nonlinear Conjugate Gradient-Style Adaptive Momentum

Momentum plays a crucial role in stochastic gradient-based optimization algorithms for accelerating or improving training deep neural networks (DNNs). In deep learning practice, the momentum is usually weighted by a well-calibrated…

Machine Learning · Computer Science 2020-12-04 Bao Wang , Qiang Ye

Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes

The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and…

Machine Learning · Computer Science 2025-08-07 Chengcheng Yan , Jiawei Xu , Zheng Peng , Qingsong Wang

Online Learning Rate Adaptation with Hypergradient Descent

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We demonstrate the effectiveness of the method in a range of optimization problems by…

Machine Learning · Computer Science 2018-08-23 Atilim Gunes Baydin , Robert Cornish , David Martinez Rubio , Mark Schmidt , Frank Wood

DIVEBATCH: Accelerating Model Training Through Gradient-Diversity Aware Batch Size Adaptation

The goal of this paper is to accelerate the training of machine learning models, a critical challenge since the training of large-scale deep neural models can be computationally expensive. Stochastic gradient descent (SGD) and its variants…

Machine Learning · Computer Science 2025-09-22 Yuen Chen , Yian Wang , Hari Sundaram

Learning Under Delayed Feedback: Implicitly Adapting to Gradient Delays

We consider stochastic convex optimization problems, where several machines act asynchronously in parallel while sharing a common memory. We propose a robust training method for the constrained setting and derive non asymptotic convergence…

Machine Learning · Computer Science 2021-06-24 Rotem Zamir Aviv , Ido Hakimi , Assaf Schuster , Kfir Y. Levy

Accelerating Asynchronous Stochastic Gradient Descent for Neural Machine Translation

In order to extract the best possible performance from asynchronous stochastic gradient descent one must increase the mini-batch size and scale the learning rate accordingly. In order to achieve further speedup we introduce a technique that…

Computation and Language · Computer Science 2018-09-17 Nikolay Bogoychev , Marcin Junczys-Dowmunt , Kenneth Heafield , Alham Fikri Aji

Adaptive Learning Rate Clipping Stabilizes Learning

Artificial neural network training with stochastic gradient descent can be destabilized by "bad batches" with high losses. This is often problematic for training with small batch sizes, high order loss functions or unstably high learning…

Machine Learning · Computer Science 2020-05-21 Jeffrey M. Ede , Richard Beanland

AB-Training: A Communication-Efficient Approach for Distributed Low-Rank Learning

Communication bottlenecks severely hinder the scalability of distributed neural network training, particularly in high-performance computing (HPC) environments. We introduce AB-training, a novel data-parallel method that leverages low-rank…

Machine Learning · Computer Science 2024-07-02 Daniel Coquelin , Katherina Flügel , Marie Weiel , Nicholas Kiefer , Muhammed Öz , Charlotte Debus , Achim Streit , Markus Götz

Large batch size training of neural networks with adversarial training and second-order information

The most straightforward method to accelerate Stochastic Gradient Descent (SGD) computation is to distribute the randomly selected batch of inputs over multiple processors. To keep the distributed processors fully utilized requires…

Machine Learning · Computer Science 2020-01-06 Zhewei Yao , Amir Gholami , Daiyaan Arfeen , Richard Liaw , Joseph Gonzalez , Kurt Keutzer , Michael Mahoney

Balance is Essence: Accelerating Sparse Training via Adaptive Gradient Correction

Despite impressive performance, deep neural networks require significant memory and computation costs, prohibiting their application in resource-constrained scenarios. Sparse training is one of the most common techniques to reduce these…

Machine Learning · Computer Science 2023-12-06 Bowen Lei , Dongkuan Xu , Ruqi Zhang , Shuren He , Bani K. Mallick

AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations

Adaptive optimization methods have been widely used in deep learning. They scale the learning rates adaptively according to the past gradient, which has been shown to be effective to accelerate the convergence. However, they suffer from…

Machine Learning · Computer Science 2021-07-06 Hongwei Zhang , Weidong Zou , Hongbo Zhao , Qi Ming , Tijin Yan , Yuanqing Xia , Weipeng Cao

Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients

Recent work has established an empirically successful framework for adapting learning rates for stochastic gradient descent (SGD). This effectively removes all needs for tuning, while automatically reducing learning rates over time on…

Machine Learning · Computer Science 2013-03-28 Tom Schaul , Yann LeCun