English
Related papers

Related papers: Self-Adjust Softmax

200 papers

The maximum element of the vector output by the Softmax function approaches zero as the input vector size increases. Transformer-based language models rely on Softmax to compute attention scores, causing the attention distribution to…

Computation and Language · Computer Science 2025-02-03 Ken M. Nakanishi

Softmax is widely used in neural networks for multiclass classification, gate structure and attention mechanisms. The statistical assumption that the input is normal distributed supports the gradient stability of Softmax. However, when used…

Computer Vision and Pattern Recognition · Computer Science 2021-08-17 Shulun Wang , Bin Liu , Feng Liu

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as…

Computation and Language · Computer Science 2026-05-12 Omar Naim , Swarnadeep Bhar , Jérôme Bolte , Nicholas Asher

Transformer attention is typically implemented using softmax normalization, which enforces attention weights with unit sum normalization. While effective in many settings, this constraint can limit flexibility in controlling attention…

Computation and Language · Computer Science 2026-02-27 Jeongin Bae , Baeseong Park , Gunho Park , Minsub Kim , Joonhyung Lee , Junhee Yoo , Sunghyeon Woo , Jiwon Ryu , Se Jung Kwon , Dongsoo Lee

Large language models rely on attention mechanisms with a softmax activation. Yet the dominance of softmax over alternatives (e.g., component-wise or linear) remains poorly understood, and many theoretical works have focused on the…

Machine Learning · Computer Science 2026-02-27 O. Duranthon , P. Marion , C. Boyer , B. Loureiro , L. Zdeborová

This paper questions whether the strong performance of softmax attention in transformers stems from producing a probability distribution over inputs. Instead, we argue that softmax's effectiveness lies in its implicit regularization of the…

Machine Learning · Computer Science 2026-03-16 Hemanth Saratchandran , Jianqiao Zheng , Yiping Ji , Wenbo Zhang , Simon Lucey

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between…

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Softmax self-attention often assigns disproportionate weight to semantically uninformative tokens such as special tokens and punctuation, a phenomenon known as attention noise. While recent methods like Cog Attention and the Differential…

Computation and Language · Computer Science 2025-08-27 Ivan Kobyzev , Abbas Ghaddar , Dingtao Hu , Boxing Chen

Transformers have transformed the field of natural language processing. This performance is largely attributed to the use of stacked self-attention layers, each of which consists of matrix multiplies as well as softmax operations. As a…

Hardware Architecture · Computer Science 2021-03-18 Jacob R. Stevens , Rangharajan Venkatesan , Steve Dai , Brucek Khailany , Anand Raghunathan

Vision transformers (ViTs) have pushed the state-of-the-art for various visual recognition tasks by patch-wise image tokenization followed by self-attention. However, the employment of self-attention modules results in a quadratic…

Computer Vision and Pattern Recognition · Computer Science 2022-05-03 Jiachen Lu , Jinghan Yao , Junge Zhang , Xiatian Zhu , Hang Xu , Weiguo Gao , Chunjing Xu , Tao Xiang , Li Zhang

The softmax function is widely used in artificial neural networks for the multiclass classification problems, where the softmax transformation enforces the output to be positive and sum to one, and the corresponding loss function allows to…

Machine Learning · Computer Science 2021-12-24 Shaoshi Sun , Zhenyuan Zhang , BoCheng Huang , Pengbin Lei , Jianlin Su , Shengfeng Pan , Jiarun Cao

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear…

Machine Learning · Computer Science 2025-07-08 Naoki Nishikawa , Rei Higuchi , Taiji Suzuki

Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks. The self-attention mechanism underpinning the strength of ViTs has a quadratic complexity in both computation and memory usage. This motivates the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Jiachen Lu , Junge Zhang , Xiatian Zhu , Jianfeng Feng , Tao Xiang , Li Zhang

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's…

Machine Learning · Computer Science 2025-07-15 Sai Surya Duvvuri , Inderjit S. Dhillon

Attention based models such as Transformers involve pairwise interactions between data points, modeled with a learnable attention matrix. Importantly, this attention matrix is normalized with the SoftMax operator, which makes it row-wise…

Machine Learning · Computer Science 2022-01-25 Michael E. Sander , Pierre Ablin , Mathieu Blondel , Gabriel Peyré

Certified verification of transformer attention requires bounding the softmax function over interval constraints on the pre-softmax scores. Existing verifiers relax softmax ndependently of the downstream objective, leaving avoidable slack.…

Machine Learning · Computer Science 2026-05-13 Navid Rezazadeh , Arash Gholami Davoodi

Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax…

Machine Learning · Computer Science 2025-12-15 Etienne Boursier , Claire Boyer

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the…

Computation and Language · Computer Science 2022-02-18 Zhen Qin , Weixuan Sun , Hui Deng , Dongxu Li , Yunshen Wei , Baohong Lv , Junjie Yan , Lingpeng Kong , Yiran Zhong
‹ Prev 1 2 3 10 Next ›