Related papers: Step-size Optimization for Continual Learning

A Theoretical and Empirical Study on the Convergence of Adam with an "Exact" Constant Step Size in Non-Convex Settings

In neural network training, RMSProp and Adam remain widely favoured optimisation algorithms. One of the keys to their performance lies in selecting the correct step size, which can significantly influence their effectiveness. Additionally,…

Machine Learning · Computer Science 2024-04-05 Alokendu Mazumder , Rishabh Sabharwal , Manan Tayal , Bhartendu Kumar , Punit Rathore

Meta-descent for Online, Continual Prediction

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of…

Machine Learning · Computer Science 2019-12-16 Andrew Jacobsen , Matthew Schlegel , Cameron Linke , Thomas Degris , Adam White , Martha White

Step-size Adaptation Using Exponentiated Gradient Updates

Optimizers like Adam and AdaGrad have been very successful in training large-scale neural networks. Yet, the performance of these methods is heavily dependent on a carefully tuned learning rate schedule. We show that in many large-scale…

Machine Learning · Computer Science 2022-02-02 Ehsan Amid , Rohan Anil , Christopher Fifty , Manfred K. Warmuth

Learning Feature Relevance Through Step Size Adaptation in Temporal-Difference Learning

There is a long history of using meta learning as representation learning, specifically for determining the relevance of inputs. In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step…

Machine Learning · Computer Science 2019-03-11 Alex Kearney , Vivek Veeriah , Jaden Travnik , Patrick M. Pilarski , Richard S. Sutton

TIDBD: Adapting Temporal-difference Step-sizes Through Stochastic Meta-descent

In this paper, we introduce a method for adapting the step-sizes of temporal difference (TD) learning. The performance of TD methods often depends on well chosen step-sizes, yet few algorithms have been developed for setting the step-size…

Machine Learning · Computer Science 2018-04-11 Alex Kearney , Vivek Veeriah , Jaden B. Travnik , Richard S. Sutton , Patrick M. Pilarski

L4: Practical loss-based stepsize adaptation for deep learning

We propose a stepsize adaptation scheme for stochastic gradient descent. It operates directly with the loss function and rescales the gradient in order to make fixed predicted progress on the loss. We demonstrate its capabilities by…

Machine Learning · Computer Science 2018-12-03 Michal Rolinek , Georg Martius

Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on…

Machine Learning · Computer Science 2021-03-30 Loris Nanni , Gianluca Maguolo , Alessandra Lumini

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSprop and Adam) have been widely used in training deep neural networks. Despite their fast convergence, they can generalize worse than stochastic gradient descent. In this…

Machine Learning · Computer Science 2019-05-27 Shuai Zheng , James T. Kwok

diffGrad: An Optimization Method for Convolutional Neural Networks

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic…

Machine Learning · Computer Science 2021-11-30 Shiv Ram Dubey , Soumendu Chakraborty , Swalpa Kumar Roy , Snehasis Mukherjee , Satish Kumar Singh , Bidyut Baran Chaudhuri

Block-Normalized Gradient Method: An Empirical Study for Training Deep Neural Network

In this paper, we propose a generic and simple strategy for utilizing stochastic gradient information in optimization. The technique essentially contains two consecutive steps in each iteration: 1) computing and normalizing each block…

Machine Learning · Computer Science 2018-04-24 Adams Wei Yu , Lei Huang , Qihang Lin , Ruslan Salakhutdinov , Jaime Carbonell

Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate…

Machine Learning · Computer Science 2019-12-04 Michael R. Zhang , James Lucas , Geoffrey Hinton , Jimmy Ba

On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a…

Machine Learning · Statistics 2019-02-28 Xiaoyu Li , Francesco Orabona

Rethinking Adam: A Twofold Exponential Moving Average Approach

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid…

Machine Learning · Computer Science 2022-02-10 Yizhou Wang , Yue Kang , Can Qin , Huan Wang , Yi Xu , Yulun Zhang , Yun Fu

Special Properties of Gradient Descent with Large Learning Rates

When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well…

Machine Learning · Computer Science 2023-02-17 Amirkeivan Mohtashami , Martin Jaggi , Sebastian Stich

On the Convergence of Step Decay Step-Size for Stochastic Optimization

The convergence of stochastic gradient descent is highly dependent on the step-size, especially on non-convex problems such as neural network training. Step decay step-size schedules (constant and then cut) are widely used in practice…

Optimization and Control · Mathematics 2021-02-19 Xiaoyu Wang , Sindri Magnússon , Mikael Johansson

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimization is empirically selected in most training procedures. Moreover, the use of scheduled learning techniques such as Step-Decaying, Cyclical-Learning, and Warmup to…

Machine Learning · Computer Science 2020-06-12 Mahdi S. Hosseini , Konstantinos N. Plataniotis

Distributed Optimization and Learning for Automated Stepsize Selection with Finite Time Coordination

Distributed optimization and learning algorithms are designed to operate over large scale networks enabling processing of vast amounts of data effectively and efficiently. One of the main challenges for ensuring a smooth learning process in…

Systems and Control · Electrical Eng. & Systems 2026-01-21 Apostolos I. Rikos , Nicola Bastianello , Themistoklis Charalambous , Karl H. Johansson

Step Size Matters in Deep Learning

Training a neural network with the gradient descent algorithm gives rise to a discrete-time nonlinear dynamical system. Consequently, behaviors that are typically observed in these systems emerge during training, such as convergence to an…

Machine Learning · Computer Science 2018-10-10 Kamil Nar , S. Shankar Sastry

Differentiable Self-Adaptive Learning Rate

Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in…

Machine Learning · Computer Science 2022-10-20 Bozhou Chen , Hongzhi Wang , Chenmin Ba

Stochastic Learning under Random Reshuffling with Constant Step-sizes

In empirical risk optimization, it has been observed that stochastic gradient implementations that rely on random reshuffling of the data achieve better performance than implementations that rely on sampling the data uniformly. Recent works…

Machine Learning · Computer Science 2019-01-30 Bicheng Ying , Kun Yuan , Stefan Vlaski , Ali H. Sayed