Related papers: Dual PatchNorm

Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding

Vision transformers (ViTs) have recently demonstrated state-of-the-art performance in a variety of vision tasks, replacing convolutional neural networks (CNNs). Meanwhile, since ViT has a different architecture than CNN, it may behave…

Computer Vision and Pattern Recognition · Computer Science 2021-11-17 Bum Jun Kim , Hyeyeon Choi , Hyeonah Jang , Dong Gu Lee , Wonseok Jeong , Sang Woo Kim

Understanding and Improving Layer Normalization

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness…

Machine Learning · Computer Science 2019-11-19 Jingjing Xu , Xu Sun , Zhiyuan Zhang , Guangxiang Zhao , Junyang Lin

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with…

Computation and Language · Computer Science 2022-03-02 Hongyu Wang , Shuming Ma , Li Dong , Shaohan Huang , Dongdong Zhang , Furu Wei

GeoNorm: Unify Pre-Norm and Post-Norm with Geodesic Optimization

The placement of normalization layers, specifically Pre-Norm and Post-Norm, remains an open question in Transformer architecture design. In this work, we rethink these approaches through the lens of manifold optimization, interpreting the…

Machine Learning · Computer Science 2026-01-30 Chuanyang Zheng , Jiankai Sun , Yihang Gao , Chi Wang , Yuehao Wang , Jing Xiong , Liliang Ren , Bo Peng , Qingmei Wang , Xiaoran Shang , Mac Schwager , Anderson Schneider , Yuriy Nevmyvaka , Xiaodong Liu

Impact of Layer Norm on Memorization and Generalization in Transformers

Layer Normalization (LayerNorm) is one of the fundamental components in transformers that stabilizes training and improves optimization. In recent times, Pre-LayerNorm transformers have become the preferred choice over Post-LayerNorm…

Machine Learning · Computer Science 2025-11-14 Rishi Singhal , Jung-Eun Kim

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In…

Computer Vision and Pattern Recognition · Computer Science 2022-02-25 Cedric Renggli , André Susano Pinto , Neil Houlsby , Basil Mustafa , Joan Puigcerver , Carlos Riquelme

Masked Transformer for image Anomaly Localization

Image anomaly detection consists in detecting images or image portions that are visually different from the majority of the samples in a dataset. The task is of practical importance for various real-life applications like biomedical image…

Computer Vision and Pattern Recognition · Computer Science 2022-10-28 Axel De Nardin , Pankaj Mishra , Gian Luca Foresti , Claudio Piciarelli

Three things everyone should know about Vision Transformers

After their initial success in natural language processing, transformer architectures have rapidly gained traction in computer vision, providing state-of-the-art results for tasks such as image classification, detection, segmentation, and…

Computer Vision and Pattern Recognition · Computer Science 2022-03-21 Hugo Touvron , Matthieu Cord , Alaaeldin El-Nouby , Jakob Verbeek , Hervé Jégou

2-D SSM: A General Spatial Layer for Visual Transformers

A central objective in computer vision is to design models with appropriate 2-D inductive bias. Desiderata for 2D inductive bias include two-dimensional position awareness, dynamic spatial locality, and translation and permutation…

Computer Vision and Pattern Recognition · Computer Science 2023-06-13 Ethan Baron , Itamar Zimerman , Lior Wolf

Vision Transformers with Patch Diversification

Vision transformer has demonstrated promising performance on challenging computer vision tasks. However, directly training the vision transformers may yield unstable and sub-optimal results. Recent works propose to improve the performance…

Computer Vision and Pattern Recognition · Computer Science 2021-06-14 Chengyue Gong , Dilin Wang , Meng Li , Vikas Chandra , Qiang Liu

FlashNorm: Fast Normalization for Transformers

Normalization layers are ubiquitous in large language models (LLMs) yet represent a compute bottleneck: on hardware with distinct vector and matrix execution units, the RMS calculation blocks the subsequent matrix multiplication, preventing…

Machine Learning · Computer Science 2026-04-28 Nils Graef , Filip Makraduli , Andrew Wasielewski , Matthew Clapp

Surface Normal Estimation with Transformers

We propose the use of a Transformer to accurately predict normals from point clouds with noise and density variations. Previous learning-based methods utilize PointNet variants to explicitly extract multi-scale features at different input…

Computer Vision and Pattern Recognition · Computer Science 2024-01-12 Barry Shichen Hu , Siyun Liang , Johannes Paetzold , Huy H. Nguyen , Isao Echizen , Jiapeng Tang

DDT: Dual-branch Deformable Transformer for Image Denoising

Transformer is beneficial for image denoising tasks since it can model long-range dependencies to overcome the limitations presented by inductive convolutional biases. However, directly applying the transformer structure to remove noise is…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Kangliang Liu , Xiangcheng Du , Sijie Liu , Yingbin Zheng , Xingjiao Wu , Cheng Jin

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks,…

Computation and Language · Computer Science 2025-12-09 Zhijian Zhuo , Yutao Zeng , Ya Wang , Sijun Zhang , Jian Yang , Xiaoqing Li , Xun Zhou , Jinwen Ma

On Batch Orthogonalization Layers

Batch normalization has become ubiquitous in many state-of-the-art nets. It accelerates training and yields good performance results. However, there are various other alternatives to normalization, e.g. orthonormalization. The objective of…

Machine Learning · Computer Science 2018-12-10 Blanchette , Laganière

On the Expressivity Role of LayerNorm in Transformers' Attention

Layer Normalization (LayerNorm) is an inherent component in all Transformer-based models. In this paper, we show that LayerNorm is crucial to the expressivity of the multi-head attention layer that follows it. This is in contrast to the…

Machine Learning · Computer Science 2023-05-12 Shaked Brody , Uri Alon , Eran Yahav

Beyond BatchNorm: Towards a Unified Understanding of Normalization in Deep Learning

Inspired by BatchNorm, there has been an explosion of normalization layers in deep learning. Recent works have identified a multitude of beneficial properties in BatchNorm to explain its success. However, given the pursuit of alternative…

Machine Learning · Computer Science 2021-10-27 Ekdeep Singh Lubana , Robert P. Dick , Hidenori Tanaka

LayerShuffle: Enhancing Robustness in Vision Transformers by Randomizing Layer Execution Order

Due to their architecture and how they are trained, artificial neural networks are typically not robust toward pruning or shuffling layers at test time. However, such properties would be desirable for different applications, such as…

Computer Vision and Pattern Recognition · Computer Science 2024-12-09 Matthias Freiberger , Peter Kun , Anders Sundnes Løvlie , Sebastian Risi

Exploiting Layer Normalization Fine-tuning in Visual Transformer Foundation Models for Classification

LayerNorm is pivotal in Vision Transformers (ViTs), yet its fine-tuning dynamics under data scarcity and domain shifts remain underexplored. This paper shows that shifts in LayerNorm parameters after fine-tuning (LayerNorm shifts) are…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Zhaorui Tan , Tan Pan , Kaizhu Huang , Weimiao Yu , Kai Yao , Chen Jiang , Qiufeng Wang , Anh Nguyen , Xin Guo , Yuan Cheng , Xi Yang

PairNorm: Tackling Oversmoothing in GNNs

The performance of graph neural nets (GNNs) is known to gradually decrease with increasing number of layers. This decay is partly attributed to oversmoothing, where repeated graph convolutions eventually make node embeddings…

Machine Learning · Computer Science 2020-02-14 Lingxiao Zhao , Leman Akoglu