Related papers: Two-Level K-FAC Preconditioning for Deep Learning

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the…

Machine Learning · Computer Science 2021-07-27 Thomas George , César Laurent , Xavier Bouthillier , Nicolas Ballas , Pascal Vincent

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's…

Machine Learning · Computer Science 2020-06-09 James Martens , Roger Grosse

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we…

Machine Learning · Computer Science 2025-11-12 Hyunseok Seung , Jaewoo Lee , Hyunsuk Ko

A Trace-restricted Kronecker-Factored Approximation to Natural Gradient

Second-order optimization methods have the ability to accelerate convergence by modifying the gradient through the curvature matrix. There have been many attempts to use second-order optimization methods for training deep neural networks.…

Machine Learning · Computer Science 2020-11-24 Kai-Xin Gao , Xiao-Lei Liu , Zheng-Hai Huang , Min Wang , Zidong Wang , Dachuan Xu , Fan Yu

Analysis and Comparison of Two-Level KFAC Methods for Training Deep Neural Networks

As a second-order method, the Natural Gradient Descent (NGD) has the ability to accelerate training of neural networks. However, due to the prohibitive computational and memory costs of computing and inverting the Fisher Information Matrix…

Machine Learning · Computer Science 2023-04-04 Abdoulaye Koroko , Ani Anciaux-Sedrakian , Ibtihel Ben Gharbia , Valérie Garès , Mounir Haddou , Quang Huy Tran

An iterative K-FAC algorithm for Deep Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The…

Machine Learning · Computer Science 2021-01-05 Yingshi Chen

Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition

Several studies have shown the ability of natural gradient descent to minimize the objective function more efficiently than ordinary gradient descent based methods. However, the bottleneck of this approach for training deep neural networks…

Neural and Evolutionary Computing · Computer Science 2022-10-17 Abdoulaye Koroko , Ani Anciaux-Sedrakian , Ibtihel Ben Gharbia , Valérie Garès , Mounir Haddou , Quang Huy Tran

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful…

Machine Learning · Computer Science 2022-06-13 Frederik Benzing

Convolutional Neural Network Training with Distributed K-FAC

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an…

Machine Learning · Computer Science 2020-07-03 J. Gregory Pauloski , Zhao Zhang , Lei Huang , Weijia Xu , Ian T. Foster

A Mini-Block Fisher Method for Deep Neural Networks

Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal…

Machine Learning · Computer Science 2022-10-28 Achraf Bahamou , Donald Goldfarb , Yi Ren

Eigenvalue-corrected Natural Gradient Based on a New Approximation

Using second-order optimization methods for training deep neural networks (DNNs) has attracted many researchers. A recently proposed method, Eigenvalue-corrected Kronecker Factorization (EKFAC) (George et al., 2018), proposes an…

Machine Learning · Computer Science 2020-11-30 Kai-Xin Gao , Xiao-Lei Liu , Zheng-Hai Huang , Min Wang , Shuangling Wang , Zidong Wang , Dachuan Xu , Fan Yu

A Kronecker-factored approximate Fisher matrix for convolution layers

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to…

Machine Learning · Statistics 2016-05-25 Roger Grosse , James Martens

AdaFisher: Adaptive Second Order Optimization via Fisher Information

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic…

Machine Learning · Computer Science 2025-03-12 Damien Martins Gomes , Yanlei Zhang , Eugene Belilovsky , Guy Wolf , Mahdi S. Hosseini

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based…

Machine Learning · Computer Science 2025-02-12 Son Nguyen , Bo Liu , Lizhang Chen , Qiang Liu

Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis

First-order optimization methods remain the standard for training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by preconditioning the stochastic gradient with a diagonal matrix. Despite the…

Machine Learning · Computer Science 2025-04-30 Damien Martins Gomes

FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We provide an accessible and detailed analysis of the diagonal…

Machine Learning · Computer Science 2024-09-05 Dongseong Hwang

Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra

K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very…

Machine Learning · Computer Science 2022-11-28 Constantin Octavian Puiu

A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a…

Statistical Finance · Quantitative Finance 2024-11-25 Tsogt-Ochir Enkhbayar

Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates

K-FAC (arXiv:1503.05671, arXiv:1602.01407) is a tractable implementation of Natural Gradient (NG) for Deep Learning (DL), whose bottleneck is computing the inverses of the so-called ``Kronecker-Factors'' (K-factors). RS-KFAC…

Machine Learning · Computer Science 2023-09-13 Constantin Octavian Puiu

Natural Gradient Methods: Perspectives, Efficient-Scalable Approximations, and Analysis

Natural Gradient Descent, a second-degree optimization method motivated by the information geometry, makes use of the Fisher Information Matrix instead of the Hessian which is typically used. However, in many cases, the Fisher Information…

Machine Learning · Computer Science 2023-03-10 Rajesh Shrestha