English
Related papers

Related papers: Decoupled Orthogonal Dynamics: Regularization for …

200 papers

Adaptive optimizers with decoupled weight decay, such as AdamW, are the de facto standard for pre-training large transformer-based generative models. Yet the quadratic nature of the $\ell_2$ penalty embedded in weight decay drives all…

Machine Learning · Computer Science 2025-11-19 Fu-Ming Guo , Yingfang Fan

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as…

Machine Learning · Computer Science 2019-01-08 Ilya Loshchilov , Frank Hutter

In this paper, we investigate the convergence properties of a wide class of Adam-family methods for minimizing quadratically regularized nonsmooth nonconvex optimization problems, especially in the context of training nonsmooth neural…

Optimization and Control · Mathematics 2023-10-16 Kuangyu Ding , Nachuan Xiao , Kim-Chuan Toh

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and…

Machine Learning · Computer Science 2026-04-15 Jason Chuan-Chih Chou

Weight decay (WD) is a traditional regularization technique in deep learning, but despite its ubiquity, its behavior is still an area of active research. Golatkar et al. have recently shown that WD only matters at the start of the training…

Machine Learning · Computer Science 2020-12-29 Johan Bjorck , Kilian Weinberger , Carla Gomes

Offline reinforcement learning (RL) can fail spectacularly when bootstrapped temporal-difference (TD) updates amplify their own errors, driving the critic toward extreme and unusable Q-values. A key counterintuitive insight of this work is…

Machine Learning · Computer Science 2026-05-05 Nan Qiao , Sheng Yue , Shuning Wang , Ju Ren

Efficient stochastic optimization typically integrates an update direction that performs well in the deterministic regime with a mechanism adapting to stochastic perturbations. While Adam uses adaptive moment estimates to promote stability,…

Machine Learning · Computer Science 2026-02-23 Minxin Zhang , Yuxuan Liu , Hayden Schaeffer

Adam has been widely adopted for training deep neural networks due to less hyperparameter tuning and remarkable performance. To improve generalization, Adam is typically used in tandem with a squared $\ell_2$ regularizer (referred to as…

Machine Learning · Computer Science 2022-02-02 Zhenxun Zhuang , Mingrui Liu , Ashok Cutkosky , Francesco Orabona

We note that decoupled weight decay regularization is a particular case of weight norm control where the target norm of weights is set to 0. Any optimization method (e.g., Adam) which uses decoupled weight decay regularization…

Machine Learning · Computer Science 2023-11-22 Ilya Loshchilov

Adaptive gradient methods such as Adam have gained increasing popularity in deep learning optimization. However, it has been observed that compared with (stochastic) gradient descent, Adam can converge to a different solution with a…

Machine Learning · Computer Science 2021-08-26 Difan Zou , Yuan Cao , Yuanzhi Li , Quanquan Gu

AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses…

Machine Learning · Computer Science 2026-04-21 Junkang Liu , Fanhua Shang , Hongying Liu , Yuxuan Tian , Yuanyuan Liu , Jin Liu , Kewen Zhu , Zhouchen Lin

We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos…

Machine Learning · Computer Science 2022-11-22 Ran Tian , Ankur P. Parikh

Adam with decoupled weight decay, also known as AdamW, is widely acclaimed for its superior performance in language modeling tasks, surpassing Adam with $\ell_2$ regularization in terms of generalization and optimization. However, this…

Machine Learning · Computer Science 2024-04-09 Shuo Xie , Zhiyuan Li

In this paper, we introduce weight prediction into the AdamW optimizer to boost its convergence when training the deep neural network (DNN) models. In particular, ahead of each mini-batch training, we predict the future weights according to…

Machine Learning · Computer Science 2023-08-09 Lei Guan

How does the choice of optimization algorithm shape a model's ability to learn features? To address this question for steepest descent methods --including sign descent, which is closely related to Adam --we introduce steepest mirror flows…

Machine Learning · Computer Science 2026-03-03 Tom Jacobs , Chao Zhou , Rebekka Burkholz

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights…

Machine Learning · Computer Science 2021-01-19 Byeongho Heo , Sanghyuk Chun , Seong Joon Oh , Dongyoon Han , Sangdoo Yun , Gyuwan Kim , Youngjung Uh , Jung-Woo Ha

The recently proposed Muon optimizer updates weight matrices via orthogonalized momentum and has demonstrated strong empirical success in large language model training. However, it remains unclear how to determine the learning rates for…

Machine Learning · Computer Science 2025-09-09 Minxin Zhang , Yuxuan Liu , Hayden Schaeffer

Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why…

Machine Learning · Computer Science 2024-12-23 Rhys Gould , Hidenori Tanaka

We propose NovoGrad, an adaptive stochastic gradient descent method with layer-wise gradient normalization and decoupled weight decay. In our experiments on neural networks for image classification, speech recognition, machine translation,…

We identify and formalize an underexplored phenomenon in deep learning optimization: directional alignment and loss convergence can be decoupled. An optimizer can exhibit near-perfect directional consistency (cc_t -> 1, measured via…

Machine Learning · Computer Science 2026-05-08 Victor Daniel Gera
‹ Prev 1 2 3 10 Next ›