Related papers: Shampoo: Preconditioned Stochastic Tensor Optimiza…

A New Perspective on Shampoo's Preconditioner

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an…

Machine Learning · Computer Science 2024-06-26 Depen Morwani , Itai Shapira , Nikhil Vyas , Eran Malach , Sham Kakade , Lucas Janson

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying…

Machine Learning · Computer Science 2026-02-03 Ionut-Vlad Modoranu , Philip Zmushko , Erik Schultheis , Mher Safaryan , Dan Alistarh

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex…

Machine Learning · Computer Science 2025-07-16 Shuo Xie , Tianhao Wang , Sashank Reddi , Sanjiv Kumar , Zhiyuan Li

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent…

Machine Learning · Computer Science 2026-02-11 Runa Eschenhagen , Anna Cai , Tsung-Hsien Lee , Hao-Jun Michael Shi

SOAP: Improving and Stabilizing Shampoo using Adam

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when…

Machine Learning · Computer Science 2025-02-03 Nikhil Vyas , Depen Morwani , Rosie Zhao , Mujin Kwun , Itai Shapira , David Brandfonbrener , Lucas Janson , Sham Kakade

Memory-Efficient 4-bit Preconditioned Stochastic Optimization

Preconditioned stochastic optimization algorithms, exemplified by Shampoo, outperform first-order optimizers by offering theoretical convergence benefits and practical gains in large-scale neural network training. However, they incur…

Machine Learning · Computer Science 2025-03-13 Jingyang Li , Kuangyu Ding , Kim-Chuan Toh , Pan Zhou

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product…

Machine Learning · Computer Science 2023-09-14 Hao-Jun Michael Shi , Tsung-Hsien Lee , Shintaro Iwasaki , Jose Gallego-Posada , Zhijing Li , Kaushik Rangadurai , Dheevatsa Mudigere , Michael Rabbat

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as…

Machine Learning · Computer Science 2026-05-04 S. Gratton , Ph. L. Toint

Understanding SOAP from the Perspective of Gradient Whitening

Shampoo with Adam in the Preconditioner's eigenbasis (SOAP) has recently emerged as a promising optimization algorithm for neural network training, achieving superior training efficiency over both Adam and Shampoo in language modeling…

Machine Learning · Computer Science 2025-09-30 Yanqing Lu , Letao Wang , Jinbo Liu

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such…

Machine Learning · Computer Science 2025-10-30 Runa Eschenhagen , Aaron Defazio , Tsung-Hsien Lee , Richard E. Turner , Hao-Jun Michael Shi

4-bit Shampoo for Memory-Efficient Network Training

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models…

Machine Learning · Computer Science 2025-01-13 Sike Wang , Pan Zhou , Jia Li , Hua Huang

Hyperparameter Transfer Enables Consistent Gains of Matrix-Preconditioned Optimizers Across Scales

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts…

Machine Learning · Computer Science 2026-01-21 Shikai Qiu , Zixi Chen , Hoang Phan , Qi Lei , Andrew Gordon Wilson

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a…

Machine Learning · Statistics 2023-06-01 Jonathan Mei , Alexander Moreno , Luke Walters

A Stable Whitening Optimizer for Efficient Neural Network Training

In this work, we take an experimentally grounded look at neural network optimization. Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method. First, we find that…

Machine Learning · Computer Science 2025-10-27 Kevin Frans , Sergey Levine , Pieter Abbeel

PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective

The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training…

Optimization and Control · Mathematics 2026-02-06 Tim Tsz-Kit Lau , Qi Long , Weijie Su

Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds…

Optimization and Control · Mathematics 2026-02-10 Ruichen Jiang , Zakaria Mhammedi , Mehryar Mohri , Aryan Mokhtari

Optimal Projection-Free Adaptive SGD for Matrix Optimization

Recently, Jiang et al. [2026] developed Leon, a practical variant of One-sided Shampoo [Xie et al., 2025a, An et al., 2025] algorithm for online convex optimization, which does not require computing a costly quadratic projection at each…

Optimization and Control · Mathematics 2026-04-06 Dmitry Kovalev

Improving Adaptive Moment Optimization via Preconditioner Diagonalization

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based…

Machine Learning · Computer Science 2025-02-12 Son Nguyen , Bo Liu , Lizhang Chen , Qiang Liu

SGD with Adaptive Preconditioning: Unified Analysis and Momentum Acceleration

In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix…

Machine Learning · Computer Science 2025-07-01 Dmitry Kovalev

Adam: A Method for Stochastic Optimization

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba