English
Related papers

Related papers: Efficient Attention via Control Variates

200 papers

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not…

Computation and Language · Computer Science 2021-03-23 Hao Peng , Nikolaos Pappas , Dani Yogatama , Roy Schwartz , Noah A. Smith , Lingpeng Kong

Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the…

Machine Learning · Computer Science 2022-06-16 Lin Zheng , Chong Wang , Lingpeng Kong

Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity…

Machine Learning · Computer Science 2025-11-11 Hanwen Liu , Yixuan Ma , Shi Jin , Yuguang Wang

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by…

Machine Learning · Computer Science 2024-08-22 Yuhan Guo , Lizhong Ding , Ye Yuan , Guoren Wang

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and…

Machine Learning · Computer Science 2026-05-26 Peter Racioppo

The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qihang Fan , Huaibo Huang , Ran He

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear…

Machine Learning · Computer Science 2025-07-08 Naoki Nishikawa , Rei Higuchi , Taiji Suzuki

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention…

Machine Learning · Computer Science 2026-01-08 Jiaxu Liu , Yuhe Bai , Xiangyu Yin , Christos-Savvas Bouganis

The human brain uses selective attention to filter perceptual input so that only the components that are useful for behaviour are processed using its limited computational resources. We focus on one particular form of visual attention known…

Neurons and Cognition · Quantitative Biology 2020-08-31 Sam Blakeman , Denis Mareschal

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at…

Machine Learning · Computer Science 2025-10-03 Yifei Zuo , Yutong Yin , Zhichen Zeng , Ang Li , Banghua Zhu , Zhaoran Wang

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Dongchen Han , Xuran Pan , Yizeng Han , Shiji Song , Gao Huang

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these…

Machine Learning · Computer Science 2026-03-31 Yan Xie , Tiansheng Wen , Tangda Huang , Bo Chen , Chenyu You , Stefanie Jegelka , Yifei Wang

Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their…

Computation and Language · Computer Science 2021-07-09 Pedro Henrique Martins , Vlad Niculae , Zita Marinho , André Martins

While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emph{global competition}, a…

Machine Learning · Computer Science 2026-02-03 Mingwei Xu , Xuan Lin , Xinnan Guo , Wanqing Xu , Wanyun Cui

Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by…

Machine Learning · Statistics 2024-10-04 Isaac Reid , Stratis Markou , Krzysztof Choromanski , Richard E. Turner , Adrian Weller

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Qihang Fan , Huaibo Huang , Yuang Ai , Ran He

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou
‹ Prev 1 2 3 10 Next ›