Related papers: Faster Adaptive Optimization via Expected Gradient…

How do simple rotations affect the implicit bias of Adam?

Adaptive gradient methods such as Adam and Adagrad are widely used in machine learning, yet their effect on the generalization of learned models -- relative to methods like gradient descent -- remains poorly understood. Prior work on binary…

Machine Learning · Computer Science 2025-10-29 Adela DePavia , Vasileios Charisopoulos , Rebecca Willett

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based…

Machine Learning · Computer Science 2025-02-12 Son Nguyen , Bo Liu , Lizhang Chen , Qiang Liu

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect…

Machine Learning · Computer Science 2021-12-17 Bohan Wang , Qi Meng , Wei Chen , Tie-Yan Liu

Memory-Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter,…

Machine Learning · Computer Science 2019-09-13 Rohan Anil , Vineet Gupta , Tomer Koren , Yoram Singer

Reparametrizing gradient descent

In this work, we propose an optimization algorithm which we call norm-adapted gradient descent. This algorithm is similar to other gradient-based optimization algorithms like Adam or Adagrad in that it adapts the learning rate of stochastic…

Machine Learning · Computer Science 2020-10-14 David Sprunger

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee…

Machine Learning · Computer Science 2019-10-22 Xiaoxia Wu , Simon S. Du , Rachel Ward

Expectigrad: Fast Stochastic Optimization with Robust Convergence Properties

Many popular adaptive gradient methods such as Adam and RMSProp rely on an exponential moving average (EMA) to normalize their stepsizes. While the EMA makes these methods highly responsive to new gradient information, recent research has…

Machine Learning · Computer Science 2021-10-13 Brett Daley , Christopher Amato

Fast and Correct Gradient-Based Optimisation for Probabilistic Programming via Smoothing

We study the foundations of variational inference, which frames posterior inference as an optimisation problem, for probabilistic programming. The dominant approach for optimisation in practice is stochastic gradient descent. In particular,…

Programming Languages · Computer Science 2023-01-10 Basim Khajwal , C. -H. Luke Ong , Dominik Wagner

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We…

Machine Learning · Statistics 2018-05-23 Ashia C. Wilson , Rebecca Roelofs , Mitchell Stern , Nathan Srebro , Benjamin Recht

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for…

Machine Learning · Computer Science 2025-09-09 Minxin Zhang , Yuxuan Liu , Hayden Schaeffer

Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods,…

Machine Learning · Computer Science 2025-09-01 Tatyana Matveeva , Aleksandr Katrutsa , Evgeny Frolov

Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs

Adaptive gradient methods have attracted much attention of machine learning communities due to the high efficiency. However their acceleration effect in practice, especially in neural network training, is hard to analyze, theoretically. The…

Optimization and Control · Mathematics 2020-06-15 Xunpeng Huang , Hao Zhou , Runxin Xu , Zhe Wang , Lei Li

Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)

Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data.…

Machine Learning · Computer Science 2021-02-22 Sharan Vaswani , Issam Laradji , Frederik Kunstner , Si Yi Meng , Mark Schmidt , Simon Lacoste-Julien

Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets

Adaptive gradient algorithms perform gradient-based updates using the history of gradients and are ubiquitous in training deep neural networks. While adaptive gradient methods theory is well understood for minimization problems, the…

Optimization and Control · Mathematics 2020-12-29 Mingrui Liu , Youssef Mroueh , Jerret Ross , Wei Zhang , Xiaodong Cui , Payel Das , Tianbao Yang

Scalable Adaptive Stochastic Optimization Using Random Projections

Adaptive stochastic gradient methods such as AdaGrad have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by…

Machine Learning · Statistics 2016-11-22 Gabriel Krummenacher , Brian McWilliams , Yannic Kilcher , Joachim M. Buhmann , Nicolai Meinshausen

AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

Most popular optimizers for deep learning can be broadly categorized as adaptive methods (e.g. Adam) and accelerated schemes (e.g. stochastic gradient descent (SGD) with momentum). For many models such as convolutional neural networks…

Machine Learning · Computer Science 2020-12-22 Juntang Zhuang , Tommy Tang , Yifan Ding , Sekhar Tatikonda , Nicha Dvornek , Xenophon Papademetris , James S. Duncan

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Stochastic optimization plays a crucial role in the advancement of deep learning technologies. Over the decades, significant effort has been dedicated to improving the training efficiency and robustness of deep neural networks, via various…

Machine Learning · Computer Science 2024-08-21 Huixiu Jiang , Ling Yang , Yu Bao , Rutong Si , Sikun Yang

Model-Based Reparameterization Policy Gradient Methods: Theory and Practical Algorithms

ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely adopted for continuous control tasks in robotics and computer graphics. However, recent studies have revealed that, when applied to long-term reinforcement learning…

Machine Learning · Computer Science 2023-11-01 Shenao Zhang , Boyi Liu , Zhaoran Wang , Tuo Zhao

Exploring the Optimized Value of Each Hyperparameter in Various Gradient Descent Algorithms

In the recent years, various gradient descent algorithms including the methods of gradient descent, gradient descent with momentum, adaptive gradient (AdaGrad), root-mean-square propagation (RMSProp) and adaptive moment estimation (Adam)…

Machine Learning · Computer Science 2024-09-19 Abel C. H. Chen