Related papers: Memory-Efficient 4-bit Preconditioned Stochastic O…

4-bit Shampoo for Memory-Efficient Network Training

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models…

Machine Learning · Computer Science 2025-01-13 Sike Wang , Pan Zhou , Jia Li , Hua Huang

Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware…

Machine Learning · Computer Science 2018-03-05 Vineet Gupta , Tomer Koren , Yoram Singer

DASH: Faster Shampoo via Batched Block Preconditioning and Efficient Inverse-Root Solvers

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying…

Machine Learning · Computer Science 2026-02-03 Ionut-Vlad Modoranu , Philip Zmushko , Erik Schultheis , Mher Safaryan , Dan Alistarh

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent…

Machine Learning · Computer Science 2026-02-11 Runa Eschenhagen , Anna Cai , Tsung-Hsien Lee , Hao-Jun Michael Shi

KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned Stochastic Optimization

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a…

Machine Learning · Statistics 2023-06-01 Jonathan Mei , Alexander Moreno , Luke Walters

A New Perspective on Shampoo's Preconditioner

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an…

Machine Learning · Computer Science 2024-06-26 Depen Morwani , Itai Shapira , Nikhil Vyas , Eran Malach , Sham Kakade , Lucas Janson

Structured Preconditioners in Adaptive Optimization: A Unified Analysis

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex…

Machine Learning · Computer Science 2025-07-16 Shuo Xie , Tianhao Wang , Sashank Reddi , Sanjiv Kumar , Zhiyuan Li

QuZO: Quantized Zeroth-Order Fine-Tuning for Large Language Models

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks.…

Machine Learning · Computer Science 2025-02-19 Jiajun Zhou , Yifan Yang , Kai Zhen , Ziyue Liu , Yequan Zhao , Ershad Banijamali , Athanasios Mouchtaris , Ngai Wong , Zheng Zhang

Purifying Shampoo: Investigating Shampoo's Heuristics by Decomposing its Preconditioner

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such…

Machine Learning · Computer Science 2025-10-30 Runa Eschenhagen , Aaron Defazio , Tsung-Hsien Lee , Richard E. Turner , Hao-Jun Michael Shi

SOAP: Improving and Stabilizing Shampoo using Adam

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when…

Machine Learning · Computer Science 2025-02-03 Nikhil Vyas , Depen Morwani , Rosie Zhao , Mujin Kwun , Itai Shapira , David Brandfonbrener , Lucas Janson , Sham Kakade

An Improved Modified Cholesky Decomposition Method for Precision Matrix Estimation

The modified Cholesky decomposition is commonly used for precision matrix estimation given a specified order of random variables. However, the order of variables is often not available or cannot be pre-determined. In this work, we propose…

Machine Learning · Statistics 2021-11-23 Xiaoning Kang , Xinwei Deng

Reparametrizing Shampoo and SOAP for Subspace Basis Updates and BFloat16 Storage

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain…

Machine Learning · Computer Science 2026-05-27 Alan Milligan , Zikun Xu , Simon Lacoste-Julien , Felix Dangel , Wu Lin

A Computationally Efficient Sparsified Online Newton Method

Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable…

Machine Learning · Computer Science 2023-11-17 Fnu Devvrit , Sai Surya Duvvuri , Rohan Anil , Vineet Gupta , Cho-Jui Hsieh , Inderjit Dhillon

A Comprehensive Evaluation of Quantization Strategies for Large Language Models

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

Beyond Outliers: A Study of Optimizers Under Quantization

As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both…

Machine Learning · Computer Science 2025-10-03 Georgios Vlassis , Saleh Ashkboos , Alexandra Volkova , Torsten Hoefler , Dan Alistarh

A fast quasi-Newton-type method for large-scale stochastic optimisation

During recent years there has been an increased interest in stochastic adaptations of limited memory quasi-Newton methods, which compared to pure gradient-based routines can improve the convergence by incorporating second order information.…

Optimization and Control · Mathematics 2018-10-03 Adrian Wills , Carl Jidling , Thomas Schon

Eva: A General Vectorized Approximation Framework for Second-order Optimization

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order…

Machine Learning · Computer Science 2023-08-07 Lin Zhang , Shaohuai Shi , Bo Li

Memory-Efficient Optimization with Factorized Hamiltonian Descent

Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience…

Machine Learning · Computer Science 2024-10-18 Son Nguyen , Lizhang Chen , Bo Liu , Qiang Liu

Neural Acceleration of Incomplete Cholesky Preconditioners

The solution of a sparse system of linear equations is ubiquitous in scientific applications. Iterative methods, such as the Preconditioned Conjugate Gradient method (PCG), are normally chosen over direct methods due to memory and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-04 Joshua Dennis Booth , Hongyang Sun , Trevor Garnett

Quantization enabled Privacy Protection in Decentralized Stochastic Optimization

By enabling multiple agents to cooperatively solve a global optimization problem in the absence of a central coordinator, decentralized stochastic optimization is gaining increasing attention in areas as diverse as machine learning,…

Optimization and Control · Mathematics 2022-08-10 Yongqiang Wang , Tamer Basar