Related papers: Decoupled Orthogonal Dynamics: Regularization for …

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all…

Machine Learning · Computer Science 2025-11-19 Fu-Ming Guo , Yingfang Fan

Decoupled Weight Decay Regularization

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as…

Machine Learning · Computer Science 2019-01-08 Ilya Loshchilov , Frank Hutter

Adam-family Methods with Decoupled Weight Decay in Deep Learning

In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural…

Optimization and Control · Mathematics 2023-10-16 Kuangyu Ding , Nachuan Xiao , Kim-Chuan Toh

Correction of Decoupled Weight Decay

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and…

Machine Learning · Computer Science 2026-04-15 Jason Chuan-Chih Chou

Understanding Decoupled and Early Weight Decay

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training…

Machine Learning · Computer Science 2020-12-29 Johan Bjorck , Kilian Weinberger , Carla Gomes

AdamO: A Collapse-Suppressed Optimizer for Offline RL

Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is…

Machine Learning · Computer Science 2026-05-05 Nan Qiao , Sheng Yue , Shuning Wang , Ju Ren

Adam Improves Muon: Adaptive Moment Estimation with Orthogonalized Momentum

Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability,…

Machine Learning · Computer Science 2026-02-23 Minxin Zhang , Yuxuan Liu , Hayden Schaeffer

Understanding AdamW through Proximal Methods and Scale-Freeness

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as…

Machine Learning · Computer Science 2022-02-02 Zhenxun Zhuang , Mingrui Liu , Ashok Cutkosky , Francesco Orabona

Weight Norm Control

We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization…

Machine Learning · Computer Science 2023-11-22 Ilya Loshchilov

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a…

Machine Learning · Computer Science 2021-08-26 Difan Zou , Yuan Cao , Yuanzhi Li , Quanquan Gu

FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses…

Machine Learning · Computer Science 2026-04-21 Junkang Liu , Fanhua Shang , Hongying Liu , Yuxuan Tian , Yuanyuan Liu , Jin Liu , Kewen Zhu , Zhouchen Lin

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos…

Machine Learning · Computer Science 2022-11-22 Ran Tian , Ankur P. Parikh

Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization

Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this…

Machine Learning · Computer Science 2024-04-09 Shuo Xie , Zhiyuan Li

Weight Prediction Boosts the Convergence of AdamW

In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to…

Machine Learning · Computer Science 2023-08-09 Lei Guan

Never Saddle for Reparameterized Steepest Descent as Mirror Flow

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows…

Machine Learning · Computer Science 2026-03-03 Tom Jacobs , Chao Zhou , Rebekka Burkholz

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights…

Machine Learning · Computer Science 2021-01-19 Byeongho Heo , Sanghyuk Chun , Seong Joon Oh , Dongyoon Han , Sangdoo Yun , Gyuwan Kim , Youngjung Uh , Jung-Woo Ha

AdaGrad Meets Muon: Adaptive Stepsizes for Orthogonal Updates

The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for…

Machine Learning · Computer Science 2025-09-09 Minxin Zhang , Yuxuan Liu , Hayden Schaeffer

Continuous-Time Analysis of Adaptive Optimization and Normalization

Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why…

Machine Learning · Computer Science 2024-12-23 Rhys Gould , Hidenori Tanaka

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation,…

Machine Learning · Computer Science 2020-02-10 Boris Ginsburg , Patrice Castonguay , Oleksii Hrinchuk , Oleksii Kuchaiev , Vitaly Lavrukhin , Ryan Leary , Jason Li , Huyen Nguyen , Yang Zhang , Jonathan M. Cohen

Directional Consistency as a Complementary Optimization Signal: The GONO Framework

We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via…

Machine Learning · Computer Science 2026-05-08 Victor Daniel Gera