Related papers: Normalized Attention Without Probability Cage

Limitations of Normalization in Attention Mechanism

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token…

Machine Learning · Computer Science 2025-10-21 Timur Mudarisov , Mikhail Burtsev , Tatiana Petrova , Radu State

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the…

Machine Learning · Computer Science 2026-03-16 Hemanth Saratchandran , Jianqiao Zheng , Yiping Ji , Wenbo Zhang , Simon Lucey

Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax…

Machine Learning · Computer Science 2026-04-20 Yuval Ran-Milo

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

Sigmoid Self-Attention has Lower Sample Complexity than Softmax Self-Attention: A Mixture-of-Experts Perspective

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure…

Machine Learning · Computer Science 2025-05-27 Fanqi Yan , Huy Nguyen , Pedram Akbarian , Nhat Ho , Alessandro Rinaldo

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

Sinkformers: Transformers with Doubly Stochastic Attention

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise…

Machine Learning · Computer Science 2022-01-25 Michael E. Sander , Pierre Ablin , Mathieu Blondel , Gabriel Peyré

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between…

Machine Learning · Computer Science 2025-01-23 Jason Ramapuram , Federico Danieli , Eeshan Dhekane , Floris Weers , Dan Busbridge , Pierre Ablin , Tatiana Likhomanenko , Jagrit Digani , Zijin Gu , Amitis Shidani , Russ Webb

Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic…

Machine Learning · Computer Science 2026-05-12 Akash Yadav , Taiwo A. Adebiyi , Ruda Zhang

Long-Context Generalization with Sparse Attention

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for…

Computation and Language · Computer Science 2026-03-03 Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

A Mechanistic Analysis of Transformers for Dynamical Systems

Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models,…

Machine Learning · Computer Science 2025-12-25 Gregory Duthé , Nikolaos Evangelou , Wei Liu , Ioannis G. Kevrekidis , Eleni Chatzi

Universal Approximation with Softmax Attention

We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our…

Machine Learning · Computer Science 2025-12-17 Jerry Yao-Chieh Hu , Hude Liu , Hong-Yu Chen , Weimin Wu , Han Liu

Why Softmax Attention Outperforms Linear Attention

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Memorization Capacity of Multi-Head Attention in Transformers

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention…

Machine Learning · Computer Science 2024-03-05 Sadegh Mahdavi , Renjie Liao , Christos Thrampoulidis

Softmax Attention with Constant Cost per Token

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of…

Machine Learning · Computer Science 2024-04-30 Franz A. Heinsen

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only…

Machine Learning · Computer Science 2026-02-06 Jiecheng Lu , Xu Han , Yan Sun , Viresh Pati , Yubin Kim , Siddhartha Somani , Shihao Yang

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

SOFT: Softmax-free Transformer with Linear Complexity

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic…

Computer Vision and Pattern Recognition · Computer Science 2022-05-03 Jiachen Lu , Jinghan Yao , Junge Zhang , Xiatian Zhu , Hang Xu , Weiguo Gao , Chunjing Xu , Tao Xiang , Li Zhang