Related papers: Query-Key Normalization for Transformers

Enhanced QKNorm normalization for neural transformers with the Lp norm

The normalization of query and key vectors is an essential part of the Transformer architecture. It ensures that learning is stable regardless of the scale of these vectors. Some normalization approaches are available. In this preliminary…

Machine Learning · Computer Science 2026-02-06 Ezequiel Lopez-Rubio , Javier Montes-Perez , Esteban Jose Palomo

Transformers without Tears: Improving the Normalization of Self-Attention

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large…

Computation and Language · Computer Science 2020-01-01 Toan Q. Nguyen , Julian Salazar

KL Regularized Normalization Framework for Low Resource Tasks

Large pre-trained models, such as Bert, GPT, and Wav2Vec, have demonstrated great potential for learning representations that are transferable to a wide variety of downstream tasks . It is difficult to obtain a large quantity of supervised…

Computation and Language · Computer Science 2022-12-23 Neeraj Kumar , Ankur Narang , Brejesh Lall

UnitNorm: Rethinking Normalization for Transformers in Time Series

Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks, yet traditional methods like batch and layer normalization often lead to issues such as token shift, attention…

Machine Learning · Computer Science 2024-05-28 Nan Huang , Christian Kümmerle , Xiang Zhang

Low-resource neural machine translation with morphological modeling

Morphological modeling in neural machine translation (NMT) is a promising approach to achieving open-vocabulary machine translation for morphologically-rich languages. However, existing methods such as sub-word tokenization and…

Computation and Language · Computer Science 2024-04-04 Antoine Nzeyimana

Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization

The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Firas Khader , Omar S. M. El Nahhas , Tianyu Han , Gustav Müller-Franzes , Sven Nebelung , Jakob Nikolas Kather , Daniel Truhn

Learning in Compact Spaces with Approximately Normalized Transformer

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization…

Machine Learning · Computer Science 2025-11-20 Jörg K. H. Franke , Urs Spiegelhalter , Marianna Nezhurina , Jenia Jitsev , Frank Hutter , Michael Hefenbrock

Training Deeper Neural Machine Translation Models with Transparent Attention

While current state-of-the-art NMT models, such as RNN seq2seq and Transformers, possess a large number of parameters, they are still shallow in comparison to convolutional models used for both text and vision applications. In this work we…

Computation and Language · Computer Science 2018-09-06 Ankur Bapna , Mia Xu Chen , Orhan Firat , Yuan Cao , Yonghui Wu

Optimizing Transformer for Low-Resource Neural Machine Translation

Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has…

Computation and Language · Computer Science 2020-11-05 Ali Araabi , Christof Monz

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, many challenges remain in training deep transformer networks,…

Computation and Language · Computer Science 2025-12-09 Zhijian Zhuo , Yutao Zeng , Ya Wang , Sijun Zhang , Jian Yang , Xiaoqing Li , Xun Zhou , Jinwen Ma

Cross Modal Retrieval with Querybank Normalisation

Profiting from large-scale training datasets, advances in neural architecture design and efficient inference, joint embeddings have become the dominant approach for tackling cross-modal retrieval. In this work we first show that, despite…

Computer Vision and Pattern Recognition · Computer Science 2022-04-20 Simion-Vlad Bogolin , Ioana Croitoru , Hailin Jin , Yang Liu , Samuel Albanie

Key-Value Transformer

Transformers have emerged as the prevailing standard solution for various AI tasks, including computer vision and natural language processing. The widely adopted Query, Key, and Value formulation (QKV) has played a significant role in this.…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Ali Borji

Controlling changes to attention logits

Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known…

Machine Learning · Computer Science 2025-11-27 Ben Anson , Laurence Aitchison

Efficient Machine Translation with a BiLSTM-Attention Approach

With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation…

Computation and Language · Computer Science 2024-11-01 Yuxu Wu , Yiren Xing

Norm Tweaking: High-performance Low-bit Quantization of Large Language Models

As the size of large language models (LLMs) continues to grow, model compression without sacrificing accuracy has become a crucial challenge for deployment. While some quantization methods, such as GPTQ, have made progress in achieving…

Machine Learning · Computer Science 2023-12-14 Liang Li , Qingyuan Li , Bo Zhang , Xiangxiang Chu

On Retrieval Augmentation and the Limitations of Language Model Training

Augmenting a language model (LM) with $k$-nearest neighbors ($k$NN) retrieval on its training data alone can decrease its perplexity, though the underlying reasons for this remain elusive. In this work, we rule out one previously posited…

Computation and Language · Computer Science 2024-04-03 Ting-Rui Chiang , Xinyan Velocity Yu , Joshua Robinson , Ollie Liu , Isabelle Lee , Dani Yogatama

Enhanced Transformer Architecture for Natural Language Processing

Transformer is a state-of-the-art model in the field of natural language processing (NLP). Current NLP models primarily increase the number of transformers to improve processing performance. However, this technique requires a lot of…

Computation and Language · Computer Science 2023-10-18 Woohyeon Moon , Taeyoung Kim , Bumgeun Park , Dongsoo Har

Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

In Transformer models, non-GEMM (non-General Matrix Multiplication) operations -- especially Softmax and Layer Normalization (LayerNorm) -- often dominate hardware cost due to their nonlinear nature. To address this, previous approximation…

Hardware Architecture · Computer Science 2026-04-28 Dawon Choi , Hana Kim , Ji-Hoon Kim

PolyNorm: Few-Shot LLM-Based Text Normalization for Text-to-Speech

Text Normalization (TN) is a key preprocessing step in Text-to-Speech (TTS) systems, converting written forms into their canonical spoken equivalents. Traditional TN systems can exhibit high accuracy, but involve substantial engineering…

Computation and Language · Computer Science 2025-11-06 Michel Wong , Ali Alshehri , Sophia Kao , Haotian He

PowerNorm: Rethinking Batch Normalization in Transformers

The standard normalization method for neural network (NN) models used in Natural Language Processing (NLP) is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The…

Computation and Language · Computer Science 2021-04-21 Sheng Shen , Zhewei Yao , Amir Gholami , Michael W. Mahoney , Kurt Keutzer