English
Related papers

Related papers: Shampoo: Preconditioned Stochastic Tensor Optimiza…

200 papers

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an…

Machine Learning · Computer Science 2024-06-26 Depen Morwani , Itai Shapira , Nikhil Vyas , Eran Malach , Sham Kakade , Lucas Janson

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying…

Machine Learning · Computer Science 2026-02-03 Ionut-Vlad Modoranu , Philip Zmushko , Erik Schultheis , Mher Safaryan , Dan Alistarh

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex…

Machine Learning · Computer Science 2025-07-16 Shuo Xie , Tianhao Wang , Sashank Reddi , Sanjiv Kumar , Zhiyuan Li

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent…

Machine Learning · Computer Science 2026-02-11 Runa Eschenhagen , Anna Cai , Tsung-Hsien Lee , Hao-Jun Michael Shi

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when…

Machine Learning · Computer Science 2025-02-03 Nikhil Vyas , Depen Morwani , Rosie Zhao , Mujin Kwun , Itai Shapira , David Brandfonbrener , Lucas Janson , Sham Kakade

Preconditioned stochastic optimization algorithms, exemplified by Shampoo, outperform first-order optimizers by offering theoretical convergence benefits and practical gains in large-scale neural network training. However, they incur…

Machine Learning · Computer Science 2025-03-13 Jingyang Li , Kuangyu Ding , Kim-Chuan Toh , Pan Zhou

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product…

A unified framework for first-order optimization algorithms fornonconvex unconstrained optimization is proposed that uses adaptivelypreconditioned gradients and includes popular methods such as full anddiagonal AdaGrad, AdaNorm, as well as…

Machine Learning · Computer Science 2026-05-04 S. Gratton , Ph. L. Toint

Shampoo with Adam in the Preconditioner's eigenbasis (SOAP) has recently emerged as a promising optimization algorithm for neural network training, achieving superior training efficiency over both Adam and Shampoo in language modeling…

Machine Learning · Computer Science 2025-09-30 Yanqing Lu , Letao Wang , Jinbo Liu

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such…

Machine Learning · Computer Science 2025-10-30 Runa Eschenhagen , Aaron Defazio , Tsung-Hsien Lee , Richard E. Turner , Hao-Jun Michael Shi

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models…

Machine Learning · Computer Science 2025-01-13 Sike Wang , Pan Zhou , Jia Li , Hua Huang

Several recently introduced deep learning optimizers utilizing matrix-level preconditioning have shown promising speedups relative to the current dominant optimizer AdamW, particularly in relatively small-scale experiments. However, efforts…

Machine Learning · Computer Science 2026-01-21 Shikai Qiu , Zixi Chen , Hoang Phan , Qi Lei , Andrew Gordon Wilson

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a…

Machine Learning · Statistics 2023-06-01 Jonathan Mei , Alexander Moreno , Luke Walters

In this work, we take an experimentally grounded look at neural network optimization. Building on the Shampoo family of algorithms, we identify and alleviate three key issues, resulting in the proposed SPlus method. First, we find that…

Machine Learning · Computer Science 2025-10-27 Kevin Frans , Sergey Levine , Pieter Abbeel

The ever-growing scale of deep learning models and training data underscores the critical importance of efficient optimization methods. While preconditioned gradient methods such as Adam and AdamW are the de facto optimizers for training…

Optimization and Control · Mathematics 2026-02-06 Tim Tsz-Kit Lau , Qi Long , Weijie Su

We study online linear optimization with matrix variables constrained by the operator norm, a setting where the geometry renders designing data-dependent and efficient adaptive algorithms challenging. The best-known adaptive regret bounds…

Optimization and Control · Mathematics 2026-02-10 Ruichen Jiang , Zakaria Mhammedi , Mehryar Mohri , Aryan Mokhtari

Recently, Jiang et al. [2026] developed Leon, a practical variant of One-sided Shampoo [Xie et al., 2025a, An et al., 2025] algorithm for online convex optimization, which does not require computing a costly quadratic projection at each…

Optimization and Control · Mathematics 2026-04-06 Dmitry Kovalev

Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based…

Machine Learning · Computer Science 2025-02-12 Son Nguyen , Bo Liu , Lizhang Chen , Qiang Liu

In this paper, we revisit stochastic gradient descent (SGD) with AdaGrad-type preconditioning. Our contributions are twofold. First, we develop a unified convergence analysis of SGD with adaptive preconditioning under anisotropic or matrix…

Machine Learning · Computer Science 2025-07-01 Dmitry Kovalev

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has…

Machine Learning · Computer Science 2017-01-31 Diederik P. Kingma , Jimmy Ba
‹ Prev 1 2 3 10 Next ›