Related papers: Scalable Second Order Optimization for Deep Learni…

Second-Order Stochastic Optimization for Machine Learning in Linear Time

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored…

Machine Learning · Statistics 2017-12-01 Naman Agarwal , Brian Bullins , Elad Hazan

Second-Order Neural ODE Optimizer

We propose a novel second-order optimization framework for training the emerging deep continuous-time models, specifically the Neural Ordinary Differential Equations (Neural ODEs). Since their training already involves expensive gradient…

Machine Learning · Computer Science 2021-11-09 Guan-Horng Liu , Tianrong Chen , Evangelos A. Theodorou

Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam,…

Machine Learning · Computer Science 2026-04-15 Tong Zhang , Jiangning Zhang , Zhucun Xue , Juntao Jiang , Yicheng Xu , Chengming Xu , Teng Hu , Xingyu Xie , Xiaobin Hu , Yabiao Wang , Yong Liu , Shuicheng Yan

Second-order Information in First-order Optimization Methods

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients,…

Machine Learning · Computer Science 2019-12-23 Yuzheng Hu , Licong Lin , Shange Tang

Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis

First-order optimization methods remain the standard for training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by preconditioning the stochastic gradient with a diagonal matrix. Despite the…

Machine Learning · Computer Science 2025-04-30 Damien Martins Gomes

Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order Information

We present a novel adaptive optimization algorithm for large-scale machine learning problems. Equipped with a low-cost estimate of local curvature and Lipschitz smoothness, our method dynamically adapts the search direction and step-size.…

Machine Learning · Computer Science 2021-09-14 Majid Jahani , Sergey Rusakov , Zheng Shi , Peter Richtárik , Michael W. Mahoney , Martin Takáč

First Demonstration of Second-order Training of Deep Neural Networks with In-memory Analog Matrix Computing

Second-order optimization methods, which leverage curvature information, offer faster and more stable convergence than first-order methods such as stochastic gradient descent (SGD) and Adam. However, their practical adoption is hindered by…

Emerging Technologies · Computer Science 2025-12-08 Saitao Zhang , Yubiao Luo , Shiqing Wang , Pushen Zuo , Yongxiang Li , Lunshuai Pan , Zheng Miao , Zhong Sun

Memory-Efficient Adaptive Optimization

Adaptive gradient-based optimizers such as Adagrad and Adam are crucial for achieving state-of-the-art performance in machine translation and language modeling. However, these methods maintain second-order statistics for each parameter,…

Machine Learning · Computer Science 2019-09-13 Rohan Anil , Vineet Gupta , Tomer Koren , Yoram Singer

Optimization Methods in Deep Learning: A Comprehensive Overview

In recent years, deep learning has achieved remarkable success in various fields such as image recognition, natural language processing, and speech recognition. The effectiveness of deep learning largely depends on the optimization methods…

Machine Learning · Computer Science 2023-04-25 David Shulman

First-Order Preconditioning via Hypergradient Descent

Standard gradient descent methods are susceptible to a range of issues that can impede training, such as high correlations and different scaling in parameter space.These difficulties can be addressed by second-order approaches that apply a…

Machine Learning · Computer Science 2020-04-29 Ted Moskovitz , Rui Wang , Janice Lan , Sanyam Kapoor , Thomas Miconi , Jason Yosinski , Aditya Rawal

AdaFisher: Adaptive Second Order Optimization via Fisher Information

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic…

Machine Learning · Computer Science 2025-03-12 Damien Martins Gomes , Yanlei Zhang , Eugene Belilovsky , Guy Wolf , Mahdi S. Hosseini

Adaptive First- and Second-Order Algorithms for Large-Scale Machine Learning

In this paper, we consider both first- and second-order techniques to address continuous optimization problems arising in machine learning. In the first-order case, we propose a framework of transition from deterministic or…

Machine Learning · Computer Science 2021-11-30 Sanae Lotfi , Tiphaine Bonniot de Ruisselet , Dominique Orban , Andrea Lodi

Towards Guided Descent: Optimization Algorithms for Training Neural Networks At Scale

Neural network optimization remains one of the most consequential yet poorly understood challenges in modern AI research, where improvements in training algorithms can lead to enhanced feature learning in foundation models,…

Machine Learning · Computer Science 2025-12-23 Ansh Nagwekar

Towards Differentiable Multilevel Optimization: A Gradient-Based Approach

Multilevel optimization has gained renewed interest in machine learning due to its promise in applications such as hyperparameter tuning and continual learning. However, existing methods struggle with the inherent difficulty of efficiently…

Machine Learning · Computer Science 2024-10-16 Yuntian Gu , Xuzheng Chen

Second-Order Guarantees in Centralized, Federated and Decentralized Nonconvex Optimization

Rapid advances in data collection and processing capabilities have allowed for the use of increasingly complex models that give rise to nonconvex optimization problems. These formulations, however, can be arbitrarily difficult to solve in…

Multiagent Systems · Computer Science 2020-04-01 Stefan Vlaski , Ali H. Sayed

Fast Stochastic Second-Order Adagrad for Nonconvex Bound-Constrained Optimization

ADAGB2, a generalization of the Adagrad algorithm for stochastic optimization is introduced, which is also applicable to bound-constrained problems and capable of using second-order information when available. It is shown that, given…

Optimization and Control · Mathematics 2025-05-13 S. Bellavia , S. Gratton , B. Morini , Ph. L. Toint

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from…

Machine Learning · Computer Science 2025-02-12 Abulikemu Abuduweili , Changliu Liu

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific…

Machine Learning · Computer Science 2024-06-11 Satoki Ishikawa , Ryo Karakida

Adaptive scaling of the learning rate by second order automatic differentiation

In the context of the optimization of Deep Neural Networks, we propose to rescale the learning rate using a new technique of automatic differentiation. This technique relies on the computation of the {\em curvature}, a second order…

Neural and Evolutionary Computing · Computer Science 2022-10-27 Frédéric de Gournay , Alban Gossard

Scaled stochastic gradient descent for low-rank matrix completion

The paper looks at a scaled variant of the stochastic gradient descent algorithm for the matrix completion problem. Specifically, we propose a novel matrix-scaling of the partial derivatives that acts as an efficient preconditioning for the…

Machine Learning · Computer Science 2016-10-06 Bamdev Mishra , Rodolphe Sepulchre