Related papers: Eva: A General Vectorized Approximation Framework …

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful…

Machine Learning · Computer Science 2022-06-13 Frederik Benzing

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Large-scale distributed training of deep neural networks suffer from the generalization gap caused by the increase in the effective mini-batch size. Previous approaches try to solve this problem by varying the learning rate and batch size…

Machine Learning · Computer Science 2019-04-02 Kazuki Osawa , Yohei Tsuji , Yuichiro Ueno , Akira Naruse , Rio Yokota , Satoshi Matsuoka

SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation

Approximate second-order optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods…

Machine Learning · Computer Science 2025-06-25 Dahun Shin , Dongyeop Lee , Jinseok Chung , Namhoon Lee

Efficient Second-Order Neural Network Optimization via Adaptive Trust Region Methods

Second-order optimization methods offer notable advantages in training deep neural networks by utilizing curvature information to achieve faster convergence. However, traditional second-order techniques are computationally prohibitive,…

Machine Learning · Computer Science 2024-10-04 James Vo

First Demonstration of Second-order Training of Deep Neural Networks with In-memory Analog Matrix Computing

Second-order optimization methods, which leverage curvature information, offer faster and more stable convergence than first-order methods such as stochastic gradient descent (SGD) and Adam. However, their practical adoption is hindered by…

Emerging Technologies · Computer Science 2025-12-08 Saitao Zhang , Yubiao Luo , Shiqing Wang , Pushen Zuo , Yongxiang Li , Lunshuai Pan , Zheng Miao , Zhong Sun

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a…

Machine Learning · Statistics 2023-06-01 Jonathan Mei , Alexander Moreno , Luke Walters

Stochastic Second-Order Optimization via von Neumann Series

A stochastic iterative algorithm approximating second-order information using von Neumann series is discussed. We present convergence guarantees for strongly-convex and smooth functions. Our analysis is much simpler in contrast to a similar…

Optimization and Control · Mathematics 2017-04-14 Mojmir Mutny

KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to…

Machine Learning · Computer Science 2021-09-21 J. Gregory Pauloski , Qi Huang , Lei Huang , Shivaram Venkataraman , Kyle Chard , Ian Foster , Zhao Zhang

On the Parameterization of Second-Order Optimization Effective Towards the Infinite Width

Second-order optimization has been developed to accelerate the training of deep neural networks and it is being applied to increasingly larger-scale models. In this study, towards training on further larger scales, we identify a specific…

Machine Learning · Computer Science 2024-06-11 Satoki Ishikawa , Ryo Karakida

2nd-order Updates with 1st-order Complexity

It has long been a goal to efficiently compute and use second order information on a function ($f$) to assist in numerical approximations. Here it is shown how, using only basic physics and a numerical approximation, such information can be…

Machine Learning · Computer Science 2021-05-31 Michael F. Zimmer

Accelerating Stochastic Probabilistic Inference

Recently, Stochastic Variational Inference (SVI) has been increasingly attractive thanks to its ability to find good posterior approximations of probabilistic models. It optimizes the variational objective with stochastic optimization,…

Machine Learning · Computer Science 2022-03-16 Minta Liu , Suliang Bu

Structured and Fast Optimization: The Kronecker SGD Algorithm

Stochastic gradient descent (SGD) now acts as a fundamental part of optimization in current machine learning. Meanwhile, deep learning architectures have shown outstanding performance in a wide range of fields, such as natural language…

Machine Learning · Computer Science 2026-01-27 Zhao Song , Song Yue

Minibatching Offers Improved Generalization Performance for Second Order Optimizers

Training deep neural networks (DNNs) used in modern machine learning is computationally expensive. Machine learning scientists, therefore, rely on stochastic first-order methods for training, coupled with significant hand-tuning, to obtain…

Machine Learning · Computer Science 2023-07-24 Eric Silk , Swarnita Chakraborty , Nairanjana Dasgupta , Anand D. Sarwate , Andrew Lumsdaine , Tony Chiang

Second-Order Stochastic Optimization for Machine Learning in Linear Time

First-order stochastic methods are the state-of-the-art in large-scale machine learning optimization owing to efficient per-iteration complexity. Second-order methods, while able to provide faster convergence, have been much less explored…

Machine Learning · Statistics 2017-12-01 Naman Agarwal , Brian Bullins , Elad Hazan

Stochastic Training of Neural Networks via Successive Convex Approximations

This paper proposes a new family of algorithms for training neural networks (NNs). These are based on recent developments in the field of non-convex optimization, going under the general name of successive convex approximation (SCA)…

Machine Learning · Statistics 2017-06-16 Simone Scardapane , Paolo Di Lorenzo

Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years,…

Machine Learning · Computer Science 2024-03-06 Hong Liu , Zhiyuan Li , David Hall , Percy Liang , Tengyu Ma

Second-Order Neural ODE Optimizer

We propose a novel second-order optimization framework for training the emerging deep continuous-time models, specifically the Neural Ordinary Differential Equations (Neural ODEs). Since their training already involves expensive gradient…

Machine Learning · Computer Science 2021-11-09 Guan-Horng Liu , Tianrong Chen , Evangelos A. Theodorou

A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a…

Statistical Finance · Quantitative Finance 2024-11-25 Tsogt-Ochir Enkhbayar

Low-Order Explicit Hessian Imitation Method for Large-Scale Supervised Machine Learning

An algorithm is proposed for solving optimization problems arising in neural network training for supervised learning. The unique feature of the algorithm is the use of an auxiliary loss, in addition to the original loss employed for model…

Optimization and Control · Mathematics 2026-05-11 Yunlang Zhu , Lingjun Guo , Zahra Khatti , Xiaoyi Qu , Chia-Yuan Wu , Lara Zebiane , Frank E. Curtis

Evolution of Optimization Methods: Algorithms, Scenarios, and Evaluations

Balancing convergence speed, generalization capability, and computational efficiency remains a core challenge in deep learning optimization. First-order gradient descent methods, epitomized by stochastic gradient descent (SGD) and Adam,…

Machine Learning · Computer Science 2026-04-15 Tong Zhang , Jiangning Zhang , Zhucun Xue , Juntao Jiang , Yicheng Xu , Chengming Xu , Teng Hu , Xingyu Xie , Xiaobin Hu , Yabiao Wang , Yong Liu , Shuicheng Yan