Related papers: L2 Regularization versus Batch and Weight Normaliz…

Norm matters: efficient and accurate normalization schemes in deep networks

Over the past few years, Batch-Normalization has been commonly used in deep networks, allowing faster training and high performance for a wide variety of applications. However, the reasons behind its merits remained unanswered, with several…

Machine Learning · Statistics 2019-02-08 Elad Hoffer , Ron Banner , Itay Golan , Daniel Soudry

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

L2 regularization for weights in neural networks is widely used as a standard training trick. However, L2 regularization for gamma, a trainable parameter of batch normalization, remains an undiscussed mystery and is applied in different…

Computer Vision and Pattern Recognition · Computer Science 2022-05-17 Bum Jun Kim , Hyeyeon Choi , Hyeonah Jang , Dong Gu Lee , Wonseok Jeong , Sang Woo Kim

Three Mechanisms of Weight Decay Regularization

Weight decay is one of the standard tricks in the neural network toolbox, but the reasons for its regularization effect are poorly understood, and recent results have cast doubt on the traditional interpretation in terms of $L_2$…

Machine Learning · Computer Science 2018-10-30 Guodong Zhang , Chaoqi Wang , Bowen Xu , Roger Grosse

Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization

Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this…

Machine Learning · Computer Science 2022-06-22 Ziquan Liu , Yufei Cui , Jia Wan , Yu Mao , Antoni B. Chan

Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization

Deep convolutional neural networks are known to be unstable during training at high learning rate unless normalization techniques are employed. Normalizing weights or activations allows the use of higher learning rates, resulting in faster…

Machine Learning · Computer Science 2019-12-02 Brendan Ruff , Taylor Beck , Joscha Bach

Weight decay induces low-rank attention layers

The effect of regularizers such as weight decay when training deep neural networks is not well understood. We study the influence of weight decay as well as $L2$-regularization when training neural network models in which parameter matrices…

Machine Learning · Computer Science 2024-11-01 Seijin Kobayashi , Yassir Akram , Johannes Von Oswald

Combining learning rate decay and weight decay with complexity gradient descent - Part I

The role of $L^2$ regularization, in the specific case of deep neural networks rather than more traditional machine learning models, is still not fully elucidated. We hypothesize that this complex interplay is due to the combination of…

Machine Learning · Computer Science 2019-02-11 Pierre H. Richemond , Yike Guo

On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay

Training neural networks with batch normalization and weight decay has become a common practice in recent years. In this work, we show that their combined use may result in a surprising periodic behavior of optimization dynamics: the…

Machine Learning · Computer Science 2022-01-19 Ekaterina Lobacheva , Maxim Kodryan , Nadezhda Chirkova , Andrey Malinin , Dmitry Vetrov

Optimization Theory for ReLU Neural Networks Trained with Normalization Layers

The success of deep neural networks is in part due to the use of normalization layers. Normalization layers like Batch Normalization, Layer Normalization and Weight Normalization are ubiquitous in practice, as they improve generalization…

Machine Learning · Computer Science 2020-06-15 Yonatan Dukler , Quanquan Gu , Guido Montúfar

Weight and Gradient Centralization in Deep Neural Networks

Batch normalization is currently the most widely used variant of internal normalization for deep neural networks. Additional work has shown that the normalization of weights and additional conditioning as well as the normalization of…

Computer Vision and Pattern Recognition · Computer Science 2021-01-19 Wolfgang Fuhl , Enkelejda Kasneci

Low-rank bias, weight decay, and model merging in neural networks

We explore the low-rank structure of the weight matrices in neural networks at the stationary points (limiting solutions of optimization algorithms) with $L2$ regularization (also known as weight decay). We show several properties of such…

Machine Learning · Computer Science 2025-08-21 Ilja Kuzborskij , Yasin Abbasi Yadkori

Training Deep Neural Networks Without Batch Normalization

Training neural networks is an optimization problem, and finding a decent set of parameters through gradient descent can be a difficult task. A host of techniques has been developed to aid this process before and during the training phase.…

Machine Learning · Computer Science 2020-08-19 Divya Gaur , Joachim Folz , Andreas Dengel

Layer Normalization

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the…

Machine Learning · Statistics 2016-07-22 Jimmy Lei Ba , Jamie Ryan Kiros , Geoffrey E. Hinton

The Effects of Regularization and Data Augmentation are Class Dependent

Regularization is a fundamental technique to prevent over-fitting and to improve generalization performances by constraining a model's complexity. Current Deep Networks heavily rely on regularizers such as Data-Augmentation (DA) or…

Machine Learning · Computer Science 2022-04-12 Randall Balestriero , Leon Bottou , Yann LeCun

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular…

Machine Learning · Computer Science 2024-06-04 Atli Kosson , Bettina Messmer , Martin Jaggi

Why Do We Need Weight Decay in Modern Deep Learning?

Weight decay is a broadly used technique for training state-of-the-art deep networks from image classification to large language models. Despite its widespread usage and being extensively studied in the classical literature, its role…

Machine Learning · Computer Science 2024-11-06 Francesco D'Angelo , Maksym Andriushchenko , Aditya Varre , Nicolas Flammarion

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

Batch normalization (BN) has become a de facto standard for training deep convolutional networks. However, BN accounts for a significant fraction of training run-time and is difficult to accelerate, since it is a memory-bandwidth bounded…

Computer Vision and Pattern Recognition · Computer Science 2017-10-10 Igor Gitman , Boris Ginsburg

Batch Normalization is a Cause of Adversarial Vulnerability

Batch normalization (batch norm) is often used in an attempt to stabilize and accelerate training in deep neural networks. In many cases it indeed decreases the number of parameter updates required to achieve low training error. However, it…

Machine Learning · Computer Science 2019-05-31 Angus Galloway , Anna Golubeva , Thomas Tanay , Medhat Moussa , Graham W. Taylor

Time Matters in Regularizing Deep Networks: Weight Decay and Data Augmentation Affect Early Learning Dynamics, Matter Little Near Convergence

Regularization is typically understood as improving generalization by altering the landscape of local extrema to which the model eventually converges. Deep neural networks (DNNs), however, challenge this view: We show that removing…

Machine Learning · Computer Science 2019-06-03 Aditya Golatkar , Alessandro Achille , Stefano Soatto

Volumization as a Natural Generalization of Weight Decay

We propose a novel regularization method, called \textit{volumization}, for neural networks. Inspired by physics, we define a physical volume for the weight parameters in neural networks, and we show that this method is an effective way of…

Machine Learning · Computer Science 2020-04-02 Liu Ziyin , Zihao Wang , Makoto Yamada , Masahito Ueda