Related papers: Random Feature Attention

Efficient Attention via Control Variates

Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon…

Machine Learning · Computer Science 2023-02-10 Lin Zheng , Jianbo Yuan , Chong Wang , Lingpeng Kong

Macformer: Transformer with Random Maclaurin Feature Attention

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by…

Machine Learning · Computer Science 2024-08-22 Yuhan Guo , Lizhong Ding , Ye Yuan , Guoren Wang

TransformerFAM: Feedback attention is working memory

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages…

Machine Learning · Computer Science 2024-05-08 Dongseong Hwang , Weiran Wang , Zhuoyuan Huo , Khe Chai Sim , Pedro Moreno Mengibar

Scaling Attention via Feature Sparsity

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these…

Machine Learning · Computer Science 2026-03-31 Yan Xie , Tiansheng Wen , Tangda Huang , Bo Chen , Chenyu You , Stefanie Jegelka , Yifei Wang

How Particle-System Random Batch Methods Enhance Graph Transformer: Memory Efficiency and Parallel Computing Strategy

Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity…

Machine Learning · Computer Science 2025-11-11 Hanwen Liu , Yixuan Ma , Shi Jin , Yuguang Wang

Linear Complexity Randomized Self-attention Mechanism

Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the…

Machine Learning · Computer Science 2022-06-16 Lin Zheng , Chong Wang , Lingpeng Kong

Gated Linear Attention Transformers with Hardware-Efficient Training

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention…

Machine Learning · Computer Science 2024-08-28 Songlin Yang , Bailin Wang , Yikang Shen , Rameswar Panda , Yoon Kim

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

Why Softmax Attention Outperforms Linear Attention

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)

Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time…

Machine Learning · Computer Science 2025-07-14 Vincenzo Dentamaro

Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regression

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at…

Machine Learning · Computer Science 2025-10-03 Yifei Zuo , Yutong Yin , Zhichen Zeng , Ang Li , Banghua Zhu , Zhaoran Wang

Forgetting Transformer: Softmax Attention with a Forget Gate

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the…

Machine Learning · Computer Science 2025-04-02 Zhixuan Lin , Evgenii Nikishin , Xu Owen He , Aaron Courville

Segmented Recurrent Transformer: An Efficient Sequence-to-Sequence Model

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained…

Computation and Language · Computer Science 2023-10-24 Yinghan Long , Sayeed Shafayet Chowdhury , Kaushik Roy

Gated Slot Attention for Efficient Linear-Time Sequence Modeling

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant…

Computation and Language · Computer Science 2024-11-01 Yu Zhang , Songlin Yang , Ruijie Zhu , Yue Zhang , Leyang Cui , Yiqiao Wang , Bolun Wang , Freda Shi , Bailin Wang , Wei Bi , Peng Zhou , Guohong Fu

RFAConv: Receptive-Field Attention Convolution for Improving Convolutional Neural Networks

In the realm of deep learning, spatial attention mechanisms have emerged as a vital method for enhancing the performance of convolutional neural networks. However, these mechanisms possess inherent limitations that cannot be overlooked.…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Xin Zhang , Chen Liu , Degang Yang , Tingting Song , Yichen Ye , Ke Li , Yingze Song

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

From Fake Focus to Real Precision: Confusion-Driven Adversarial Attention Learning in Transformers

Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information. However, these methods often exhibit suboptimal accuracy in certain scenarios. By analyzing…

Artificial Intelligence · Computer Science 2025-12-25 Yawei Liu

Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a…

Computation and Language · Computer Science 2025-09-19 Yanming Kang , Giang Tran , Hans De Sterck

Transformer Quality in Linear Time

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention…

Machine Learning · Computer Science 2022-06-28 Weizhe Hua , Zihang Dai , Hanxiao Liu , Quoc V. Le