Related papers: CompAdaGrad: A Compressed, Complementary, Computat…

Scalable Adaptive Stochastic Optimization Using Random Projections

Adaptive stochastic gradient methods such as AdaGrad have gained popularity in particular for training deep neural networks. The most commonly used and studied variant maintains a diagonal matrix approximation to second order information by…

Machine Learning · Statistics 2016-11-22 Gabriel Krummenacher , Brian McWilliams , Yannic Kilcher , Joachim M. Buhmann , Nicolai Meinshausen

A Full Adagrad algorithm with O(Nd) operations

A novel approach is given to overcome the computational challenges of the full-matrix Adaptive Gradient algorithm (Full AdaGrad) in stochastic optimization. By developing a recursive method that estimates the inverse of the square root of…

Statistics Theory · Mathematics 2025-02-28 Antoine Godichon-Baggioni , Wei Lu , Bruno Portier

Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models

Adaptive gradient methods like Adagrad and its variants are widespread in large-scale optimization. However, their use of diagonal preconditioning matrices limits the ability to capture parameter correlations. Full-matrix adaptive methods,…

Machine Learning · Computer Science 2025-09-01 Tatyana Matveeva , Aleksandr Katrutsa , Evgeny Frolov

Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and…

Machine Learning · Computer Science 2021-08-27 Aaron Defazio , Samy Jelassi

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization…

Machine Learning · Statistics 2021-04-20 Rachel Ward , Xiaoxia Wu , Leon Bottou

MetaGrad: Adaptation using Multiple Learning Rates in Online Learning

We provide a new adaptive method for online convex optimization, MetaGrad, that is robust to general convex losses but achieves faster rates for a broad class of special functions, including exp-concave and strongly convex functions, but…

Machine Learning · Computer Science 2021-08-31 Tim van Erven , Wouter M. Koolen , Dirk van der Hoeven

Adaptive Gradient Methods for Constrained Convex Optimization and Variational Inequalities

We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms AdaACSA and AdaAGD+ are accelerated methods, which are universal in the sense that they achieve nearly-optimal convergence rates for both…

Machine Learning · Computer Science 2021-02-17 Alina Ene , Huy L. Nguyen , Adrian Vladu

AdaGrad-Diff: A New Version of the Adaptive Gradient Algorithm

Vanilla gradient methods are often highly sensitive to the choice of stepsize, which typically requires manual tuning. Adaptive methods alleviate this issue and have therefore become widely used. Among them, AdaGrad has been particularly…

Machine Learning · Statistics 2026-02-16 Matia Bojovic , Saverio Salzo , Massimiliano Pontil

MetaGrad: Multiple Learning Rates in Online Learning

In online convex optimization it is well known that certain subclasses of objective functions are much easier than arbitrary convex functions. We are interested in designing adaptive methods that can automatically get fast rates in as many…

Machine Learning · Computer Science 2021-08-31 Tim van Erven , Wouter M. Koolen

Grad-GradaGrad? A Non-Monotone Adaptive Stochastic Gradient Method

The classical AdaGrad method adapts the learning rate by dividing by the square root of a sum of squared gradients. Because this sum on the denominator is increasing, the method can only decrease step sizes over time, and requires a…

Machine Learning · Computer Science 2022-06-15 Aaron Defazio , Baoyu Zhou , Lin Xiao

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100 of TeraOps/s of computational capacity) is expected to be severely communication constrained. To overcome this limitation, new gradient…

Machine Learning · Computer Science 2017-12-08 Chia-Yu Chen , Jungwook Choi , Daniel Brand , Ankur Agrawal , Wei Zhang , Kailash Gopalakrishnan

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Adaptive gradient methods like AdaGrad are widely used in optimizing neural networks. Yet, existing convergence guarantees for adaptive gradient methods require either convexity or smoothness, and, in the smooth setting, only guarantee…

Machine Learning · Computer Science 2019-10-22 Xiaoxia Wu , Simon S. Du , Rachel Ward

Stability and convergence analysis of AdaGrad for non-convex optimization via novel stopping time-based techniques

Adaptive gradient optimizers (AdaGrad), which dynamically adjust the learning rate based on iterative gradients, have emerged as powerful tools in deep learning. These adaptive methods have significantly succeeded in various deep learning…

Optimization and Control · Mathematics 2024-12-31 Ruinan Jin , Xiaoyu Wang , Baoxiang Wang

Universality of AdaGrad Stepsizes for Stochastic Optimization: Inexact Oracle, Acceleration and Variance Reduction

We present adaptive gradient methods (both basic and accelerated) for solving convex composite optimization problems in which the main part is approximately smooth (a.k.a. $(\delta, L)$-smooth) and can be accessed only via a (potentially…

Optimization and Control · Mathematics 2024-06-11 Anton Rodomanov , Xiaowen Jiang , Sebastian Stich

Dynamic Regret of Adaptive Gradient Methods for Strongly Convex Problems

Adaptive gradient algorithms such as ADAGRAD and its variants have gained popularity in the training of deep neural networks. While many works as for adaptive methods have focused on the static regret as a performance metric to achieve a…

Machine Learning · Computer Science 2022-09-07 Parvin Nazari , Esmaile Khorram

AdaGrad under Anisotropic Smoothness

Adaptive gradient methods have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite the huge success in practice, their theoretical advantages over classical gradient methods with…

Machine Learning · Computer Science 2024-10-15 Yuxing Liu , Rui Pan , Tong Zhang

diffGrad: An Optimization Method for Convolutional Neural Networks

Stochastic Gradient Decent (SGD) is one of the core techniques behind the success of deep neural networks. The gradient provides information on the direction in which a function has the steepest rate of change. The main problem with basic…

Machine Learning · Computer Science 2021-11-30 Shiv Ram Dubey , Soumendu Chakraborty , Swalpa Kumar Roy , Snehasis Mukherjee , Satish Kumar Singh , Bidyut Baran Chaudhuri

Adaptive Online Learning for Gradient-Based Optimizers

As application demands for online convex optimization accelerate, the need for designing new methods that simultaneously cover a large class of convex functions and impose the lowest possible regret is highly rising. Known online…

Machine Learning · Computer Science 2019-06-04 Saeed Masoudian , Ali Arabzadeh , Mahdi Jafari Siavoshani , Milad Jalal , Alireza Amouzad

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We…

Machine Learning · Statistics 2018-05-23 Ashia C. Wilson , Rebecca Roelofs , Mitchell Stern , Nathan Srebro , Benjamin Recht

On the Convergence of AdaGrad(Norm) on $\R^{d}$: Beyond Convexity, Non-Asymptotic Rate and Acceleration

Existing analysis of AdaGrad and other adaptive methods for smooth convex optimization is typically for functions with bounded domain diameter. In unconstrained problems, previous works guarantee an asymptotic convergence rate without an…

Machine Learning · Computer Science 2023-10-05 Zijian Liu , Ta Duy Nguyen , Alina Ene , Huy L. Nguyen