Related papers: Decoupled Weight Decay Regularization

Understanding Decoupled and Early Weight Decay

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training…

Machine Learning · Computer Science 2020-12-29 Johan Bjorck , Kilian Weinberger , Carla Gomes

AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all…

Machine Learning · Computer Science 2025-11-19 Fu-Ming Guo , Yingfang Fan

Understanding the Generalization of Adam in Learning Neural Networks with Proper Regularization

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a…

Machine Learning · Computer Science 2021-08-26 Difan Zou , Yuan Cao , Yuanzhi Li , Quanquan Gu

Adam-family Methods with Decoupled Weight Decay in Deep Learning

In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural…

Optimization and Control · Mathematics 2023-10-16 Kuangyu Ding , Nachuan Xiao , Kim-Chuan Toh

Decoupled Orthogonal Dynamics: Regularization for Deep Network Optimizers

Is the standard weight decay in AdamW truly optimal? Although AdamW decouples weight decay from adaptive gradient scaling, a fundamental conflict remains: the Radial Tug-of-War. In deep learning, gradients tend to increase parameter norms…

Machine Learning · Computer Science 2026-02-06 Hao Chen , Jinghui Yuan , Hanmin Zhang

Implicit Bias of AdamW: $\ell_\infty$ Norm Constrained Optimization

Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this…

Machine Learning · Computer Science 2024-04-09 Shuo Xie , Zhiyuan Li

Adaptive Weight Decay for Deep Neural Networks

Regularization in the optimization of deep neural networks is often critical to avoid undesirable over-fitting leading to better generalization of model. One of the most popular regularization algorithms is to impose L-2 penalty on the…

Machine Learning · Computer Science 2019-08-09 Kensuke Nakamura , Byung-Woo Hong

Correction of Decoupled Weight Decay

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and…

Machine Learning · Computer Science 2026-04-15 Jason Chuan-Chih Chou

Understanding the Disharmony between Weight Normalization Family and Weight Decay: $\epsilon-$shifted $L_2$ Regularizer

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight…

Machine Learning · Computer Science 2019-11-15 Li Xiang , Chen Shuo , Xia Yan , Yang Jian

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$…

Machine Learning · Computer Science 2018-10-30 Guodong Zhang , Chaoqi Wang , Bowen Xu , Roger Grosse

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this…

Machine Learning · Computer Science 2022-06-22 Ziquan Liu , Yufei Cui , Jia Wan , Yu Mao , Antoni B. Chan

PathProx: A Proximal Gradient Algorithm for Weight Decay Regularized Deep Neural Networks

Weight decay is one of the most widely used forms of regularization in deep learning, and has been shown to improve generalization and robustness. The optimization objective driving weight decay is a sum of losses plus a term proportional…

Machine Learning · Computer Science 2023-07-07 Liu Yang , Jifan Zhang , Joseph Shenouda , Dimitris Papailiopoulos , Kangwook Lee , Robert D. Nowak

A New Adaptive Gradient Method with Gradient Decomposition

Adaptive gradient methods, especially Adam-type methods (such as Adam, AMSGrad, and AdaBound), have been proposed to speed up the training process with an element-wise scaling term on learning rates. However, they often generalize poorly…

Machine Learning · Computer Science 2021-07-20 Zhou Shao , Tong Lin

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation,…

Machine Learning · Computer Science 2020-02-10 Boris Ginsburg , Patrice Castonguay , Oleksii Hrinchuk , Oleksii Kuchaiev , Vitaly Lavrukhin , Ryan Leary , Jason Li , Huyen Nguyen , Yang Zhang , Jonathan M. Cohen

Weight Norm Control

We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization…

Machine Learning · Computer Science 2023-11-22 Ilya Loshchilov

Asymmetric Momentum: A Rethinking of Gradient Descent

Through theoretical and experimental validation, unlike all existing adaptive methods like Adam which penalize frequently-changing parameters and are only applicable to sparse gradients, we propose the simplest SGD enhanced method,…

Machine Learning · Computer Science 2023-10-04 Gongyue Zhang , Dinghuang Zhang , Shuwen Zhao , Donghan Liu , Carrie M. Toptan , Honghai Liu

Understanding the Generalization of Stochastic Gradient Adam in Learning Neural Networks

Adam is a popular and widely used adaptive gradient method in deep learning, which has also received tremendous focus in theoretical research. However, most existing theoretical work primarily analyzes its full-batch version, which differs…

Machine Learning · Computer Science 2025-10-14 Xuan Tang , Han Zhang , Yuan Cao , Difan Zou

Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models

Modern optimizers such as AdamW, equipped with momentum and adaptive learning rate, are designed to escape local minima and explore the vast parameter space. This exploration is beneficial for finding good loss basins when training from…

Machine Learning · Computer Science 2024-11-05 Junjiao Tian , Chengyue Huang , Zsolt Kira

Exploiting Adam-like Optimization Algorithms to Improve the Performance of Convolutional Neural Networks

Stochastic gradient descent (SGD) is the main approach for training deep networks: it moves towards the optimum of the cost function by iteratively updating the parameters of a model in the direction of the gradient of the loss evaluated on…

Machine Learning · Computer Science 2021-03-30 Loris Nanni , Gianluca Maguolo , Alessandra Lumini

Combining learning rate decay and weight decay with complexity gradient descent - Part I

The role of $L^2$ regularization, in the specific case of deep neural networks rather than more traditional machine learning models, is still not fully elucidated. We hypothesize that this complex interplay is due to the combination of…

Machine Learning · Computer Science 2019-02-11 Pierre H. Richemond , Yike Guo