English
Related papers

Related papers: Memory-Efficient 4-bit Preconditioned Stochastic O…

200 papers

Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models…

Machine Learning · Computer Science 2025-01-13 Sike Wang , Pan Zhou , Jia Li , Hua Huang

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware…

Machine Learning · Computer Science 2018-03-05 Vineet Gupta , Tomer Koren , Yoram Singer

Shampoo is one of the leading approximate second-order optimizers: a variant of it has won the MLCommons AlgoPerf competition, and it has been shown to produce models with lower activation outliers that are easier to compress. Yet, applying…

Machine Learning · Computer Science 2026-02-03 Ionut-Vlad Modoranu , Philip Zmushko , Erik Schultheis , Mher Safaryan , Dan Alistarh

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent…

Machine Learning · Computer Science 2026-02-11 Runa Eschenhagen , Anna Cai , Tsung-Hsien Lee , Hao-Jun Michael Shi

Second order stochastic optimizers allow parameter update step size and direction to adapt to loss curvature, but have traditionally required too much memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018] introduced a…

Machine Learning · Statistics 2023-06-01 Jonathan Mei , Alexander Moreno , Luke Walters

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an…

Machine Learning · Computer Science 2024-06-26 Depen Morwani , Itai Shapira , Nikhil Vyas , Eran Malach , Sham Kakade , Lucas Janson

We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex…

Machine Learning · Computer Science 2025-07-16 Shuo Xie , Tianhao Wang , Sashank Reddi , Sanjiv Kumar , Zhiyuan Li

Language Models (LLMs) are often quantized to lower precision to reduce the memory cost and latency in inference. However, quantization often degrades model performance, thus fine-tuning is required for various down-stream tasks.…

Machine Learning · Computer Science 2025-02-19 Jiajun Zhou , Yifan Yang , Kai Zhen , Ziyue Liu , Yequan Zhao , Ershad Banijamali , Athanasios Mouchtaris , Ngai Wong , Zheng Zhang

The recent success of Shampoo in the AlgoPerf contest has sparked renewed interest in Kronecker-factorization-based optimization algorithms for training neural networks. Despite its success, Shampoo relies heavily on several heuristics such…

Machine Learning · Computer Science 2025-10-30 Runa Eschenhagen , Aaron Defazio , Tsung-Hsien Lee , Richard E. Turner , Hao-Jun Michael Shi

There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when…

Machine Learning · Computer Science 2025-02-03 Nikhil Vyas , Depen Morwani , Rosie Zhao , Mujin Kwun , Itai Shapira , David Brandfonbrener , Lucas Janson , Sham Kakade

The modified Cholesky decomposition is commonly used for precision matrix estimation given a specified order of random variables. However, the order of variables is often not available or cannot be pre-determined. In this work, we propose…

Machine Learning · Statistics 2021-11-23 Xiaoning Kang , Xinwei Deng

Shampoo-based methods, such as KL-Shampoo and SOAP, have demonstrated strong performance in training neural networks and rely on QR decomposition. Because existing QR implementations require single-precision (FP32) arithmetic and remain…

Machine Learning · Computer Science 2026-05-27 Alan Milligan , Zikun Xu , Simon Lacoste-Julien , Felix Dangel , Wu Lin

Second-order methods hold significant promise for enhancing the convergence of deep neural network training; however, their large memory and computational demands have limited their practicality. Thus there is a need for scalable…

Machine Learning · Computer Science 2023-11-17 Fnu Devvrit , Sai Surya Duvvuri , Rohan Anil , Vineet Gupta , Cho-Jui Hsieh , Inderjit Dhillon

Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques,…

Computation and Language · Computer Science 2024-06-07 Renren Jin , Jiangcun Du , Wuwei Huang , Wei Liu , Jian Luan , Bin Wang , Deyi Xiong

As new optimizers gain traction and model quantization becomes standard for efficient deployment, a key question arises: how does the choice of optimizer affect model performance in the presence of quantization? Despite progress in both…

Machine Learning · Computer Science 2025-10-03 Georgios Vlassis , Saleh Ashkboos , Alexandra Volkova , Torsten Hoefler , Dan Alistarh

During recent years there has been an increased interest in stochastic adaptations of limited memory quasi-Newton methods, which compared to pure gradient-based routines can improve the convergence by incorporating second order information.…

Optimization and Control · Mathematics 2018-10-03 Adrian Wills , Carl Jidling , Thomas Schon

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order…

Machine Learning · Computer Science 2023-08-07 Lin Zhang , Shaohuai Shi , Bo Li

Modern deep learning heavily depends on adaptive optimizers such as Adam and its variants, which are renowned for their capacity to handle model scaling and streamline hyperparameter tuning. However, these algorithms typically experience…

Machine Learning · Computer Science 2024-10-18 Son Nguyen , Lizhang Chen , Bo Liu , Qiang Liu

The solution of a sparse system of linear equations is ubiquitous in scientific applications. Iterative methods, such as the Preconditioned Conjugate Gradient method (PCG), are normally chosen over direct methods due to memory and…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-04 Joshua Dennis Booth , Hongyang Sun , Trevor Garnett

By enabling multiple agents to cooperatively solve a global optimization problem in the absence of a central coordinator, decentralized stochastic optimization is gaining increasing attention in areas as diverse as machine learning,…

Optimization and Control · Mathematics 2022-08-10 Yongqiang Wang , Tamer Basar
‹ Prev 1 2 3 10 Next ›