Related papers: Understanding and Improving Layer Normalization

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may…

Machine Learning · Computer Science 2026-05-15 Yuxin Guo , Yihao Yue , Yunhao Ni , Yizhou Ruan , Jie Luo , Wenjun Wu , Lei Huang

Root Mean Square Layer Normalization

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight…

Machine Learning · Computer Science 2019-10-17 Biao Zhang , Rico Sennrich

Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification

LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Zhaorui Tan , Tan Pan , Kaizhu Huang , Weimiao Yu , Kai Yao , Chen Jiang , Qiufeng Wang , Anh Nguyen , Xin Guo , Yuan Cheng , Xi Yang

Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. With these geometric insights, we prepare the foundation for…

Machine Learning · Computer Science 2025-02-04 Akshat Gupta , Atahan Ozdemir , Gopala Anumanchipalli

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the…

Machine Learning · Computer Science 2023-05-12 Shaked Brody , Uri Alon , Eran Yahav

GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks

Deep multitask networks, in which one neural network produces multiple predictive outputs, can offer better speed and performance than their single-task counterparts but are challenging to train properly. We present a gradient normalization…

Computer Vision and Pattern Recognition · Computer Science 2018-07-16 Zhao Chen , Vijay Badrinarayanan , Chen-Yu Lee , Andrew Rabinovich

How Does Batch Normalization Help Optimization?

Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm's effectiveness are still poorly…

Machine Learning · Statistics 2019-04-16 Shibani Santurkar , Dimitris Tsipras , Andrew Ilyas , Aleksander Madry

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative…

Machine Learning · Computer Science 2021-10-27 Ekdeep Singh Lubana , Robert P. Dick , Hidenori Tanaka

Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

This paper studies the impact of layer normalization (LayerNorm) on zero-shot translation (ZST). Recent efforts for ZST often utilize the Transformer architecture as the backbone, with LayerNorm at the input of layers (PreNorm) set as the…

Computation and Language · Computer Science 2023-05-17 Zhuoyuan Mao , Raj Dabre , Qianying Liu , Haiyue Song , Chenhui Chu , Sadao Kurohashi

Impact of Layer Norm on Memorization and Generalization in Transformers

Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm…

Machine Learning · Computer Science 2025-11-14 Rishi Singhal , Jung-Eun Kim

Layer Normalization

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the…

Machine Learning · Statistics 2016-07-22 Jimmy Lei Ba , Jamie Ryan Kiros , Geoffrey E. Hinton

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and…

Machine Learning · Computer Science 2023-10-27 Zixuan Jiang , Jiaqi Gu , Hanqing Zhu , David Z. Pan

Geometry and Dynamics of LayerNorm

A technical note aiming to offer deeper intuition for the LayerNorm function common in deep neural networks. LayerNorm is defined relative to a distinguished 'neural' basis, but it does more than just normalize the corresponding vector…

Machine Learning · Computer Science 2024-05-08 Paul M. Riechers

Efficient Multi-Domain Network Learning by Covariance Normalization

The problem of multi-domain learning of deep networks is considered. An adaptive layer is induced per target domain and a novel procedure, denoted covariance normalization (CovNorm), proposed to reduce its parameters. CovNorm is a data…

Computer Vision and Pattern Recognition · Computer Science 2019-06-26 Yunsheng Li , Nuno Vasconcelos

GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the…

Machine Learning · Computer Science 2026-01-30 Chuanyang Zheng , Jiankai Sun , Yihang Gao , Chi Wang , Yuehao Wang , Jing Xiong , Liliang Ren , Bo Peng , Qingmei Wang , Xiaoran Shang , Mac Schwager , Anderson Schneider , Yuriy Nevmyvaka , Xiaodong Liu

SeeDNorm: Self-Rescaled Dynamic Normalization

Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling…

Machine Learning · Computer Science 2026-02-12 Wenrui Cai , Defa Zhu , Qingjie Liu , Qiyang Min

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Normalization techniques have only recently begun to be exploited in supervised learning tasks. Batch normalization exploits mini-batch statistics to normalize the activations. This was shown to speed up training and result in better…

Machine Learning · Computer Science 2017-03-08 Mengye Ren , Renjie Liao , Raquel Urtasun , Fabian H. Sinz , Richard S. Zemel

AdaDM: Enabling Normalization for Image Super-Resolution

Normalization like Batch Normalization (BN) is a milestone technique to normalize the distributions of intermediate layers in deep learning, enabling faster training and better generalization accuracy. However, in fidelity image…

Image and Video Processing · Electrical Eng. & Systems 2021-11-30 Jie Liu , Jie Tang , Gangshan Wu

GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs).…

Machine Learning · Computer Science 2021-06-14 Tianle Cai , Shengjie Luo , Keyulu Xu , Di He , Tie-Yan Liu , Liwei Wang

Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction

Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held…

Machine Learning · Computer Science 2023-01-18 Kaifeng Lyu , Zhiyuan Li , Sanjeev Arora