Related papers: Holonorm

Transformers without Normalization

Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple…

Machine Learning · Computer Science 2025-06-17 Jiachen Zhu , Xinlei Chen , Kaiming He , Yann LeCun , Zhuang Liu

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models

Pre-Layer Normalization (Pre-LN) is the de facto choice for large language models (LLMs) and is crucial for stable pretraining and effective transfer learning. However, Pre-LN is inefficient due to repeated statistical calculations and…

Computation and Language · Computer Science 2026-02-04 Hoyoon Byun , Youngjun Choi , Taero Kim , Sungrae Park , Kyungwoo Song

On the Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions

Layer normalization (LN) is an essential component of modern neural networks. While many alternative techniques have been proposed, none of them have succeeded in replacing LN so far. The latest suggestion in this line of research is a…

Machine Learning · Computer Science 2026-04-15 Felix Stollenwerk

Analyzing the Training Dynamics of Image Restoration Transformers: A Revisit to Layer Normalization

This work analyzes the training dynamics of Image Restoration (IR) Transformers and uncovers a critical yet overlooked issue: conventional LayerNorm (LN) drives feature magnitudes to diverge to a million scale and collapses channel-wise…

Computer Vision and Pattern Recognition · Computer Science 2026-02-23 MinKyu Lee , Sangeek Hyun , Woojin Jun , Hyunjun Kim , Jiwoo Chung , Jae-Pil Heo

HAAN: A Holistic Approach for Accelerating Normalization Operations in Large Language Models

Large language models (LLMs) have revolutionized natural language processing (NLP) tasks by achieving state-of-the-art performance across a range of benchmarks. Central to the success of these models is the integration of sophisticated…

Hardware Architecture · Computer Science 2025-02-18 Tianfan Peng , Jiajun Qin , Tianhua Xia , Sai Qian Zhang

Haar Nuclear Norms with Applications to Remote Sensing Imagery Restoration

Remote sensing image restoration aims to reconstruct missing or corrupted areas within images. To date, low-rank based models have garnered significant interest in this field. This paper proposes a novel low-rank regularization term, named…

Image and Video Processing · Electrical Eng. & Systems 2024-12-17 Shuang Xu , Chang Yu , Jiangjun Peng , Xiangyong Cao , Deyu Meng

DeepNet: Scaling Transformers to 1,000 Layers

In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with…

Computation and Language · Computer Science 2022-03-02 Hongyu Wang , Shuming Ma , Li Dong , Shaohan Huang , Dongdong Zhang , Furu Wei

The Hyperdimensional Transform: a Holographic Representation of Functions

Integral transforms are invaluable mathematical tools to map functions into spaces where they are easier to characterize. We introduce the hyperdimensional transform as a new kind of integral transform. It converts square-integrable…

Machine Learning · Computer Science 2023-10-26 Pieter Dewulf , Michiel Stock , Bernard De Baets

Stronger Normalization-Free Transformers

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT…

Machine Learning · Computer Science 2026-04-01 Mingzhi Chen , Taiming Lu , Jiachen Zhu , Mingjie Sun , Zhuang Liu

Approximated Orthonormal Normalisation in Training Neural Networks

Generalisation of a deep neural network (DNN) is one major concern when employing the deep learning approach for solving practical problems. In this paper we propose a new technique, named approximated orthonormal normalisation (AON), to…

Machine Learning · Computer Science 2020-01-15 Guoqiang Zhang , Kenta Niwa , W. B. Kleijn

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks,…

Computation and Language · Computer Science 2025-12-09 Zhijian Zhuo , Yutao Zeng , Ya Wang , Sijun Zhang , Jian Yang , Xiaoqing Li , Xun Zhou , Jinwen Ma

When Does Removing LayerNorm Help? Activation Bounding as a Regime-Dependent Implicit Regularizer

Dynamic Tanh (DyT) removes LayerNorm by bounding activations with a learned tanh(alpha x). We show that this bounding is a regime-dependent implicit regularizer, not a uniformly beneficial replacement. Across GPT-2-family models spanning…

Machine Learning · Computer Science 2026-04-28 Lucky Verma

Multi-modal and frequency-weighted tensor nuclear norm for hyperspectral image denoising

Low-rankness is important in the hyperspectral image (HSI) denoising tasks. The tensor nuclear norm (TNN), defined based on the tensor singular value decomposition, is a state-of-the-art method to describe the low-rankness of HSI. However,…

Image and Video Processing · Electrical Eng. & Systems 2022-06-22 Xiaozhen Xie , Sheng Liu

Stabilizing Inputs to Approximated Nonlinear Functions for Inference with Homomorphic Encryption in Deep Neural Networks

Leveled Homomorphic Encryption (LHE) offers a potential solution that could allow sectors with sensitive data to utilize the cloud and securely deploy their models for remote inference with Deep Neural Networks (DNN). However, this…

Machine Learning · Computer Science 2019-02-07 Moustafa AboulAtta , Matthias Ossadnik , Seyed-Ahmad Ahmadi

TanhSoft -- a family of activation functions combining Tanh and Softplus

Deep learning at its core, contains functions that are composition of a linear transformation with a non-linear function known as activation function. In past few years, there is an increasing interest in construction of novel activation…

Neural and Evolutionary Computing · Computer Science 2020-09-09 Koushik Biswas , Sandeep Kumar , Shilpak Banerjee , Ashish Kumar Pandey

Dynamic Token Normalization Improves Vision Transformers

Vision Transformer (ViT) and its variants (e.g., Swin, PVT) have achieved great success in various computer vision tasks, owing to their capability to learn long-range contextual information. Layer Normalization (LN) is an essential…

Computer Vision and Pattern Recognition · Computer Science 2022-10-17 Wenqi Shao , Yixiao Ge , Zhaoyang Zhang , Xuyuan Xu , Xiaogang Wang , Ying Shan , Ping Luo

Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in…

Computation and Language · Computer Science 2023-12-04 Tam Nguyen , Tan M. Nguyen , Richard G. Baraniuk

L1-Norm Batch Normalization for Efficient Training of Deep Neural Networks

Batch Normalization (BN) has been proven to be quite effective at accelerating and improving the training of deep neural networks (DNNs). However, BN brings additional computation, consumes more memory and generally slows down the training…

Machine Learning · Computer Science 2019-05-23 Shuang Wu , Guoqi Li , Lei Deng , Liu Liu , Yuan Xie , Luping Shi

SpanNorm: Reconciling Training Stability and Performance in Deep Transformers

The success of Large Language Models (LLMs) hinges on the stable training of deep Transformer architectures. A critical design choice is the placement of normalization layers, leading to a fundamental trade-off: the ``PreNorm'' architecture…

Computation and Language · Computer Science 2026-02-02 Chao Wang , Bei Li , Jiaqi Zhang , Xinyu Liu , Yuchun Fan , Linkun Lyu , Xin Chen , Jingang Wang , Tong Xiao , Peng Pei , Xunliang Cai

Higher-Order Regularization Learning on Hypergraphs

Higher-Order Hypergraph Learning (HOHL) was recently introduced as a principled alternative to classical hypergraph regularization, enforcing higher-order smoothness via powers of multiscale Laplacians induced by the hypergraph structure.…

Machine Learning · Computer Science 2025-11-25 Adrien Weihs , Andrea L. Bertozzi , Matthew Thorpe