English
Related papers

Related papers: Normalized Attention Without Probability Cage

200 papers

This paper investigates the limitations of the normalization in attention mechanisms. We begin with a theoretical framework that enables the identification of the model's selective ability and the geometric separation involved in token…

Machine Learning · Computer Science 2025-10-21 Timur Mudarisov , Mikhail Burtsev , Tatiana Petrova , Radu State

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the…

Machine Learning · Computer Science 2026-03-16 Hemanth Saratchandran , Jianqiao Zheng , Yiping Ji , Wenbo Zhang , Simon Lucey

Transformers often display an attention sink: probability mass concentrates on a fixed, content-agnostic position. Are sinks a byproduct of the optimization/training regime? Or are they sometimes functionally necessary in softmax…

Machine Learning · Computer Science 2026-04-20 Yuval Ran-Milo

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

At the core of the popular Transformer architecture is the self-attention mechanism, which dynamically assigns softmax weights to each input token so that the model can focus on the most salient information. However, the softmax structure…

Machine Learning · Computer Science 2025-05-27 Fanqi Yan , Huy Nguyen , Pedram Akbarian , Nhat Ho , Alessandro Rinaldo

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise…

Machine Learning · Computer Science 2022-01-25 Michael E. Sander , Pierre Ablin , Mathieu Blondel , Gabriel Peyré

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between…

Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic…

Machine Learning · Computer Science 2026-05-12 Akash Yadav , Taiwo A. Adebiyi , Ruda Zhang

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for…

Computation and Language · Computer Science 2026-03-03 Pavlo Vasylenko , Hugo Pitorro , André F. T. Martins , Marcos Treviso

Transformers are increasingly adopted for modeling and forecasting time-series, yet their internal mechanisms remain poorly understood from a dynamical systems perspective. In contrast to classical autoregressive and state-space models,…

Machine Learning · Computer Science 2025-12-25 Gregory Duthé , Nikolaos Evangelou , Wei Liu , Ioannis G. Kevrekidis , Eleni Chatzi

We prove that with linear transformations, both (i) two-layer self-attention and (ii) one-layer self-attention followed by a softmax function are universal approximators for continuous sequence-to-sequence functions on compact domains. Our…

Machine Learning · Computer Science 2025-12-17 Jerry Yao-Chieh Hu , Hude Liu , Hong-Yu Chen , Weimin Wu , Han Liu

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention…

Machine Learning · Computer Science 2024-03-05 Sadegh Mahdavi , Renjie Liao , Christos Thrampoulidis

We propose a simple modification to the conventional attention mechanism applied by Transformers: Instead of quantifying pairwise query-key similarity with scaled dot-products, we quantify it with the logarithms of scaled dot-products of…

Machine Learning · Computer Science 2024-04-30 Franz A. Heinsen

Linear attention methods offer Transformers $O(N)$ complexity but typically underperform standard softmax attention. We identify two fundamental limitations affecting these approaches: the restriction to convex combinations that only…

Machine Learning · Computer Science 2026-02-06 Jiecheng Lu , Xu Han , Yan Sun , Viresh Pati , Yubin Kim , Siddhartha Somani , Shihao Yang

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic…

Computer Vision and Pattern Recognition · Computer Science 2022-05-03 Jiachen Lu , Jinghan Yao , Junge Zhang , Xiatian Zhu , Hang Xu , Weiguo Gao , Chunjing Xu , Tao Xiang , Li Zhang
‹ Prev 1 2 3 10 Next ›