Related papers: Self-Adjust Softmax

Scalable-Softmax Is Superior for Attention

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to…

Computation and Language · Computer Science 2025-02-03 Ken M. Nakanishi

Escaping the Gradient Vanishing: Periodic Alternatives of Softmax in Attention Mechanism

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Shulun Wang , Bin Liu , Feng Liu

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

SSA: Improving Performance With a Better Scoring Function

While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as…

Computation and Language · Computer Science 2026-05-12 Omar Naim , Swarnadeep Bhar , Jérôme Bolte , Nicholas Asher

Affine-Scaled Attention: Towards Flexible and Stable Transformer Attention

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

Statistical Advantage of Softmax Attention: Insights from Single-Location Regression

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

Rethinking Attention: Polynomial Alternatives to Softmax in Transformers

This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the…

Machine Learning · Computer Science 2026-03-16 Hemanth Saratchandran , Jianqiao Zheng , Yiping Ji , Wenbo Zhang , Simon Lucey

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between…

Machine Learning · Computer Science 2025-01-23 Jason Ramapuram , Federico Danieli , Eeshan Dhekane , Floris Weers , Dan Busbridge , Pierre Ablin , Tatiana Likhomanenko , Jagrit Digani , Zijin Gu , Amitis Shidani , Russ Webb

Why Softmax Attention Outperforms Linear Attention

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Integral Transformer: Denoising Attention, Not Too Much Not Too Little

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential…

Computation and Language · Computer Science 2025-08-27 Ivan Kobyzev , Abbas Ghaddar , Dingtao Hu , Boxing Chen

Softermax: Hardware/Software Co-Design of an Efficient Softmax for Transformers

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

SOFT: Softmax-free Transformer with Linear Complexity

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic…

Computer Vision and Pattern Recognition · Computer Science 2022-05-03 Jiachen Lu , Jinghan Yao , Junge Zhang , Xiatian Zhu , Hang Xu , Weiguo Gao , Chunjing Xu , Tao Xiang , Li Zhang

Sparse-softmax: A Simpler and Faster Alternative Softmax Transformation

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear…

Machine Learning · Computer Science 2025-07-08 Naoki Nishikawa , Rei Higuchi , Taiji Suzuki

Softmax-free Linear Transformers

Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Jiachen Lu , Junge Zhang , Xiatian Zhu , Jianfeng Feng , Tao Xiang , Li Zhang

LASER: Attention with Exponential Transformation

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's…

Machine Learning · Computer Science 2025-07-15 Sai Surya Duvvuri , Inderjit S. Dhillon

Sinkformers: Transformers with Doubly Stochastic Attention

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise…

Machine Learning · Computer Science 2022-01-25 Michael E. Sander , Pierre Ablin , Mathieu Blondel , Gabriel Peyré

Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack.…

Machine Learning · Computer Science 2026-05-13 Navid Rezazadeh , Arash Gholami Davoodi

Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax…

Machine Learning · Computer Science 2025-12-15 Etienne Boursier , Claire Boyer

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the…

Computation and Language · Computer Science 2022-02-18 Zhen Qin , Weixuan Sun , Hui Deng , Dongxu Li , Yunshen Wei , Baohong Lv , Junjie Yan , Lingpeng Kong , Yiran Zhong