Related papers: Efficient Attention via Control Variates

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not…

Computation and Language · Computer Science 2021-03-23 Hao Peng , Nikolaos Pappas , Dani Yogatama , Roy Schwartz , Noah A. Smith , Lingpeng Kong

Linear Complexity Randomized Self-attention Mechanism

Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the…

Machine Learning · Computer Science 2022-06-16 Lin Zheng , Chong Wang , Lingpeng Kong

How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity…

Machine Learning · Computer Science 2025-11-11 Hanwen Liu , Yixuan Ma , Shi Jin , Yuguang Wang

Macformer: Transformer with Random Maclaurin Feature Attention

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by…

Machine Learning · Computer Science 2024-08-22 Yuhan Guo , Lizhong Ding , Ye Yuan , Guoren Wang

Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation

We introduce Robust Filter Attention (RFA), a formulation of self-attention as a robust state estimator. Each token is treated as a noisy observation of a latent trajectory governed by a linear stochastic differential equation (SDE), and…

Machine Learning · Computer Science 2026-05-26 Peter Racioppo

Breaking the Low-Rank Dilemma of Linear Attention

The Softmax attention mechanism in Transformer models is notoriously computationally expensive, particularly due to its quadratic complexity, posing significant challenges in vision applications. In contrast, linear attention provides a far…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Qihang Fan , Huaibo Huang , Ran He

Degrees of Freedom for Linear Attention: Distilling Softmax Attention with Optimal Feature Efficiency

Linear attention has attracted interest as a computationally efficient approximation to softmax attention, especially for long sequences. Recent studies have explored distilling softmax attention in pre-trained Transformers into linear…

Machine Learning · Computer Science 2025-07-08 Naoki Nishikawa , Rei Higuchi , Taiji Suzuki

GatedFWA: Linear Flash Windowed Attention with Gated Associative Memory

Modern autoregressive models rely on attention, yet the Softmax full attention in Transformers scales quadratically with sequence length. Sliding Window Attention (SWA) achieves linear-time encoding/decoding by constraining the attention…

Machine Learning · Computer Science 2026-01-08 Jiaxu Liu , Yuhe Bai , Xiangyu Yin , Christos-Savvas Bouganis

Selective Particle Attention: Visual Feature-Based Attention in Deep Reinforcement Learning

The human brain uses selective attention to filter perceptual input so that only the components that are useful for behaviour are processed using its limited computational resources. We focus on one particular form of visual attention known…

Neurons and Cognition · Quantitative Biology 2020-08-31 Sam Blakeman , Denis Mareschal

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at…

Machine Learning · Computer Science 2025-10-03 Yifei Zuo , Yutong Yin , Zhichen Zeng , Ang Li , Banghua Zhu , Zhaoran Wang

FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Dongchen Han , Xuran Pan , Yizeng Han , Shiji Song , Gao Huang

Scaling Attention via Feature Sparsity

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these…

Machine Learning · Computer Science 2026-03-31 Yan Xie , Tiansheng Wen , Tangda Huang , Bo Chen , Chenyu You , Stefanie Jegelka , Yifei Wang

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their…

Computation and Language · Computer Science 2021-07-09 Pedro Henrique Martins , Vlad Niculae , Zita Marinho , André Martins

Softmax Linear Attention: Reclaiming Global Competition

While linear attention reduces the quadratic complexity of standard Transformers to linear time, it often lags behind in expressivity due to the removal of softmax normalization. This omission eliminates \emph{global competition}, a…

Machine Learning · Computer Science 2026-02-03 Mingwei Xu , Xuan Lin , Xinnan Guo , Wanqing Xu , Wanyun Cui

Variance-Reducing Couplings for Random Features

Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by…

Machine Learning · Statistics 2024-10-04 Isaac Reid , Stratis Markou , Krzysztof Choromanski , Richard E. Turner , Adrian Weller

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

Numerical Analysis · Mathematics 2026-04-03 Michel Fabrice Serret , Alice Cortinovis , Yijun Dong , Diana Halikias , Anna Ma , Fabio Matti , Deanna Needell , Katherine J. Pearce , Elizaveta Rebrova , Disha Shur , Rudi Smith , Hai-Xiao Wang , Laura Grigori

A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations

The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention…

Machine Learning · Computer Science 2016-09-20 Alexandre de Brébisson , Pascal Vincent

Rectifying Magnitude Neglect in Linear Attention

As the core operator of Transformers, Softmax Attention exhibits excellent global modeling capabilities. However, its quadratic complexity limits its applicability to vision tasks. In contrast, Linear Attention shares a similar formulation…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Qihang Fan , Huaibo Huang , Yuang Ai , Ran He

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

Why Softmax Attention Outperforms Linear Attention

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou