Related papers: An iterative K-FAC algorithm for Deep Learning

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). K-FAC is based on an efficiently invertible approximation of a neural network's…

Machine Learning · Computer Science 2020-06-09 James Martens , Roger Grosse

A Trace-restricted Kronecker-Factored Approximation to Natural Gradient

Second-order optimization methods have the ability to accelerate convergence by modifying the gradient through the curvature matrix. There have been many attempts to use second-order optimization methods for training deep neural networks.…

Machine Learning · Computer Science 2020-11-24 Kai-Xin Gao , Xiao-Lei Liu , Zheng-Hai Huang , Min Wang , Zidong Wang , Dachuan Xu , Fan Yu

Convolutional Neural Network Training with Distributed K-FAC

Training neural networks with many processors can reduce time-to-solution; however, it is challenging to maintain convergence and efficiency at large scales. The Kronecker-factored Approximate Curvature (K-FAC) was recently proposed as an…

Machine Learning · Computer Science 2020-07-03 J. Gregory Pauloski , Zhao Zhang , Lei Huang , Weijia Xu , Ian T. Foster

Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition

Several studies have shown the ability of natural gradient descent to minimize the objective function more efficiently than ordinary gradient descent based methods. However, the bottleneck of this approach for training deep neural networks…

Neural and Evolutionary Computing · Computer Science 2022-10-17 Abdoulaye Koroko , Ani Anciaux-Sedrakian , Ibtihel Ben Gharbia , Valérie Garès , Mounir Haddou , Quang Huy Tran

MAC: An Efficient Gradient Preconditioning using Mean Activation Approximated Curvature

Second-order optimization methods for training neural networks, such as KFAC, exhibit superior convergence by utilizing curvature information of loss landscape. However, it comes at the expense of high computational burden. In this work, we…

Machine Learning · Computer Science 2025-11-12 Hyunseok Seung , Jaewoo Lee , Hyunsuk Ko

A New Way: Kronecker-Factored Approximate Curvature Deep Hedging and its Benefits

This paper advances the computational efficiency of Deep Hedging frameworks through the novel integration of Kronecker-Factored Approximate Curvature (K-FAC) optimization. While recent literature has established Deep Hedging as a…

Statistical Finance · Quantitative Finance 2024-11-25 Tsogt-Ochir Enkhbayar

A Kronecker-factored approximate Fisher matrix for convolution layers

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the exact natural gradient is impractical to…

Machine Learning · Statistics 2016-05-25 Roger Grosse , James Martens

Kronecker-factored Approximate Curvature (KFAC) From Scratch

Kronecker-factored approximate curvature (KFAC) is arguably one of the most prominent curvature approximations in deep learning. Its applications range from optimization to Bayesian deep learning, training data attribution with influence…

Machine Learning · Computer Science 2025-07-08 Felix Dangel , Bálint Mucsányi , Tobias Weber , Runa Eschenhagen

Kronecker-Factored Approximate Curvature for Modern Neural Network Architectures

The core components of many modern neural network architectures, such as transformers, convolutional, or graph neural networks, can be expressed as linear layers with $\textit{weight-sharing}$. Kronecker-Factored Approximate Curvature…

Machine Learning · Computer Science 2024-01-12 Runa Eschenhagen , Alexander Immer , Richard E. Turner , Frank Schneider , Philipp Hennig

Two-Level K-FAC Preconditioning for Deep Learning

In the context of deep learning, many optimization methods use gradient covariance information in order to accelerate the convergence of Stochastic Gradient Descent. In particular, starting with Adagrad, a seemingly endless line of research…

Machine Learning · Computer Science 2020-12-08 Nikolaos Tselepidis , Jonas Kohler , Antonio Orvieto

Eigenvalue-corrected Natural Gradient Based on a New Approximation

Using second-order optimization methods for training deep neural networks (DNNs) has attracted many researchers. A recently proposed method, Eigenvalue-corrected Kronecker Factorization (EKFAC) (George et al., 2018), proposes an…

Machine Learning · Computer Science 2020-11-30 Kai-Xin Gao , Xiao-Lei Liu , Zheng-Hai Huang , Min Wang , Shuangling Wang , Zidong Wang , Dachuan Xu , Fan Yu

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Second-order optimizers are thought to hold the potential to speed up neural network training, but due to the enormous size of the curvature matrix, they typically require approximations to be computationally tractable. The most successful…

Machine Learning · Computer Science 2022-06-13 Frederik Benzing

A Coordinate-Free Construction of Scalable Natural Gradient

Most neural networks are trained using first-order optimization methods, which are sensitive to the parameterization of the model. Natural gradient descent is invariant to smooth reparameterizations because it is defined in a…

Machine Learning · Computer Science 2018-08-31 Kevin Luk , Roger Grosse

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates,…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-07-15 Shaohuai Shi , Lin Zhang , Bo Li

Randomized K-FACs: Speeding up K-FAC with Randomized Numerical Linear Algebra

K-FAC is a successful tractable implementation of Natural Gradient for Deep Learning, which nevertheless suffers from the requirement to compute the inverse of the Kronecker factors (through an eigen-decomposition). This can be very…

Machine Learning · Computer Science 2022-11-28 Constantin Octavian Puiu

Brand New K-FACs: Speeding up K-FAC with Online Decomposition Updates

K-FAC (arXiv:1503.05671, arXiv:1602.01407) is a tractable implementation of Natural Gradient (NG) for Deep Learning (DL), whose bottleneck is computing the inverses of the so-called ``Kronecker-Factors'' (K-factors). RS-KFAC…

Machine Learning · Computer Science 2023-09-13 Constantin Octavian Puiu

Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis

Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the…

Machine Learning · Computer Science 2021-07-27 Thomas George , César Laurent , Xavier Bouthillier , Nicolas Ballas , Pascal Vincent

Scalable Thermodynamic Second-order Optimization

Many hardware proposals have aimed to accelerate inference in AI workloads. Less attention has been paid to hardware acceleration of training, despite the enormous societal impact of rapid training of AI models. Physics-based computers,…

Emerging Technologies · Computer Science 2025-02-13 Kaelan Donatella , Samuel Duffield , Denis Melanson , Maxwell Aifer , Phoebe Klett , Rajath Salegame , Zach Belateche , Gavin Crooks , Antonio J. Martinez , Patrick J. Coles

Scalable K-FAC Training for Deep Neural Networks with Distributed Preconditioning

The second-order optimization methods, notably the D-KFAC (Distributed Kronecker Factored Approximate Curvature) algorithms, have gained traction on accelerating deep neural network (DNN) training on GPU clusters. However, existing D-KFAC…

Machine Learning · Computer Science 2022-07-01 Lin Zhang , Shaohuai Shi , Wei Wang , Bo Li

Inefficiency of K-FAC for Large Batch Size Training

In stochastic optimization, using large batch sizes during training can leverage parallel resources to produce faster wall-clock training times per training epoch. However, for both training loss and testing error, recent results analyzing…

Machine Learning · Computer Science 2021-04-21 Linjian Ma , Gabe Montague , Jiayu Ye , Zhewei Yao , Amir Gholami , Kurt Keutzer , Michael W. Mahoney