Related papers: A Trainable Optimizer

Learning to Optimize Quasi-Newton Methods

Fast gradient-based optimization algorithms have become increasingly essential for the computationally efficient training of machine learning models. One technique is to multiply the gradient by a preconditioner matrix to produce a step,…

Machine Learning · Computer Science 2023-09-12 Isaac Liao , Rumen R. Dangovski , Jakob N. Foerster , Marin Soljačić

A Triple-Inertial Accelerated Alternating Optimization Method for Deep Learning Training

The stochastic gradient descent (SGD) algorithm has achieved remarkable success in training deep learning models. However, it has several limitations, including susceptibility to vanishing gradients, sensitivity to input data, and a lack of…

Machine Learning · Computer Science 2025-03-14 Chengcheng Yan , Jiawei Xu , Qingsong Wang , Zheng Peng

Masked Training of Neural Networks with Partial Gradients

State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD). Recently, many variations have been explored: perturbing parameters for better accuracy (such as in Extragradient), limiting SGD…

Machine Learning · Computer Science 2022-03-23 Amirkeivan Mohtashami , Martin Jaggi , Sebastian U. Stich

Learning Gradient Descent: Better Generalization and Longer Horizons

Training deep neural networks is a highly nontrivial task, involving carefully selecting appropriate training algorithms, scheduling step sizes and tuning other hyperparameters. Trying different combinations can be quite labor-intensive and…

Machine Learning · Computer Science 2017-06-13 Kaifeng Lv , Shunhua Jiang , Jian Li

Sublinear Optimization for Machine Learning

We give sublinear-time approximation algorithms for some optimization problems arising in machine learning, such as training linear classifiers and finding minimum enclosing balls. Our algorithms can be extended to some kernelized versions…

Machine Learning · Computer Science 2010-10-22 Kenneth L. Clarkson , Elad Hazan , David P. Woodruff

An Adaptive Gradient Method with Energy and Momentum

We introduce a novel algorithm for gradient-based optimization of stochastic objective functions. The method may be seen as a variant of SGD with momentum equipped with an adaptive learning rate automatically adjusted by an 'energy'…

Optimization and Control · Mathematics 2022-03-24 Hailiang Liu , Xuping Tian

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD)…

Machine Learning · Computer Science 2020-06-24 Jinghui Chen , Dongruo Zhou , Yiqi Tang , Ziyan Yang , Yuan Cao , Quanquan Gu

Narrowing the Focus: Learned Optimizers for Pretrained Models

In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of…

Machine Learning · Computer Science 2024-10-08 Gus Kristiansen , Mark Sandler , Andrey Zhmoginov , Nolan Miller , Anirudh Goyal , Jihwan Lee , Max Vladymyrov

Tom: Leveraging trend of the observed gradients for faster convergence

The success of deep learning can be attributed to various factors such as increase in computational power, large datasets, deep convolutional neural networks, optimizers etc. Particularly, the choice of optimizer affects the generalization,…

Machine Learning · Computer Science 2021-09-10 Anirudh Maiya , Inumella Sricharan , Anshuman Pandey , Srinivas K. S

Learning to optimize with convergence guarantees using nonlinear system theory

The increasing reliance on numerical methods for controlling dynamical systems and training machine learning models underscores the need to devise algorithms that dependably and efficiently navigate complex optimization landscapes.…

Systems and Control · Electrical Eng. & Systems 2024-06-04 Andrea Martin , Luca Furieri

Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate…

Machine Learning · Computer Science 2019-12-04 Michael R. Zhang , James Lucas , Geoffrey Hinton , Jimmy Ba

Adaptive Optimization for Enhanced Efficiency in Large-Scale Language Model Training

With the rapid development of natural language processing technology, large-scale language models (LLM) have achieved remarkable results in a variety of tasks. However, how to effectively train these huge models and improve their…

Artificial Intelligence · Computer Science 2024-12-09 Jiajing Chen , Bingying Liu , Xiaoxuan Liao , Jia Gao , Hongye Zheng , Yue Li

Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes

The training of deep neural networks is inherently a nonconvex optimization problem, yet standard approaches such as stochastic gradient descent (SGD) require simultaneous updates to all parameters, often leading to unstable convergence and…

Machine Learning · Computer Science 2025-08-07 Chengcheng Yan , Jiawei Xu , Zheng Peng , Qingsong Wang

Greedy Learning to Optimize with Convergence Guarantees

Learning to optimize is an approach that leverages training data to accelerate the solution of optimization problems. Many approaches use unrolling to parametrize the update step and learn optimal parameters. Although L2O has shown…

Optimization and Control · Mathematics 2025-07-15 Patrick Fahy , Mohammad Golbabaee , Matthias J. Ehrhardt

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We…

Machine Learning · Statistics 2018-05-23 Ashia C. Wilson , Rebecca Roelofs , Mitchell Stern , Nathan Srebro , Benjamin Recht

Random Scaling and Momentum for Non-smooth Non-convex Optimization

Training neural networks requires optimizing a loss function that may be highly irregular, and in particular neither convex nor smooth. Popular training algorithms are based on stochastic gradient descent with momentum (SGDM), for which…

Machine Learning · Computer Science 2026-03-17 Qinzi Zhang , Ashok Cutkosky

A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm

A crucial component of machine learning algorithms is minimizing loss functions with less computational cost and less oscillations. While adaptive learning rate-based optimizers have been widely used for real-world tasks, they do not…

Machine Learning · Computer Science 2026-05-29 Sakshi Kumari , Shyam Kumar M , Sushmitha P

Learned Optimizers that Scale and Generalize

Learning to learn has emerged as an important direction for achieving artificial intelligence. Two of the primary barriers to its adoption are an inability to scale to larger problems and a limited ability to generalize to new tasks. We…

Machine Learning · Computer Science 2017-09-11 Olga Wichrowska , Niru Maheswaranathan , Matthew W. Hoffman , Sergio Gomez Colmenarejo , Misha Denil , Nando de Freitas , Jascha Sohl-Dickstein

Globally Optimal Training of Generalized Polynomial Neural Networks with Nonlinear Spectral Methods

The optimization problem behind neural networks is highly non-convex. Training with stochastic gradient descent and variants requires careful parameter tuning and provides no guarantee to achieve the global optimum. In contrast we show…

Machine Learning · Computer Science 2016-10-31 Antoine Gautier , Quynh Nguyen , Matthias Hein

Appropriate Learning Rates of Adaptive Learning Rate Optimization Algorithms for Training Deep Neural Networks

This paper deals with nonconvex stochastic optimization problems in deep learning and provides appropriate learning rates with which adaptive learning rate optimization algorithms, such as Adam and AMSGrad, can approximate a stationary…

Optimization and Control · Mathematics 2020-11-24 Hideaki Iiduka