English
Related papers

Related papers: Random Feature Attention

200 papers

Random-feature-based attention (RFA) is an efficient approximation of softmax attention with linear runtime and space complexity. However, the approximation gap between RFA and conventional softmax attention is not well studied. Built upon…

Machine Learning · Computer Science 2023-02-10 Lin Zheng , Jianbo Yuan , Chong Wang , Lingpeng Kong

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by…

Machine Learning · Computer Science 2024-08-22 Yuhan Guo , Lizhong Ding , Ye Yuan , Guoren Wang

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages…

Machine Learning · Computer Science 2024-05-08 Dongseong Hwang , Weiran Wang , Zhuoyuan Huo , Khe Chai Sim , Pedro Moreno Mengibar

Scaling Transformers to ultra-long contexts is bottlenecked by the $O(n^2 d)$ cost of self-attention. Existing methods reduce this cost along the sequence axis through local windows, kernel approximations, or token-level sparsity, but these…

Machine Learning · Computer Science 2026-03-31 Yan Xie , Tiansheng Wen , Tangda Huang , Bo Chen , Chenyu You , Stefanie Jegelka , Yifei Wang

Attention mechanism is a significant part of Transformer models. It helps extract features from embedded vectors by adding global information and its expressivity has been proved to be powerful. Nevertheless, the quadratic complexity…

Machine Learning · Computer Science 2025-11-11 Hanwen Liu , Yixuan Ma , Shi Jin , Yuguang Wang

Recently, random feature attentions (RFAs) are proposed to approximate the softmax attention in linear time and space complexity by linearizing the exponential kernel. In this paper, we first propose a novel perspective to understand the…

Machine Learning · Computer Science 2022-06-16 Lin Zheng , Chong Wang , Lingpeng Kong

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention…

Machine Learning · Computer Science 2024-08-28 Songlin Yang , Bailin Wang , Yikang Shen , Rameswar Panda , Yoon Kim

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token…

Computation and Language · Computer Science 2026-03-16 Yichuan Deng , Zhao Song , Kaijun Yuan , Tianyi Zhou

Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time…

Machine Learning · Computer Science 2025-07-14 Vincenzo Dentamaro

Transformer architectures have achieved remarkable success in various domains. While efficient alternatives to Softmax Attention have been widely studied, the search for more expressive mechanisms grounded in theoretical insight-even at…

Machine Learning · Computer Science 2025-10-03 Yifei Zuo , Yutong Yin , Zhichen Zeng , Ang Li , Banghua Zhu , Zhaoran Wang

An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the…

Machine Learning · Computer Science 2025-04-02 Zhixuan Lin , Evgenii Nikishin , Xu Owen He , Aaron Courville

Transformers have shown dominant performance across a range of domains including language and vision. However, their computational cost grows quadratically with the sequence length, making their usage prohibitive for resource-constrained…

Computation and Language · Computer Science 2023-10-24 Yinghan Long , Sayeed Shafayet Chowdhury , Kaushik Roy

Linear attention Transformers and their gated variants, celebrated for enabling parallel training and efficient recurrent inference, still fall short in recall-intensive tasks compared to traditional Transformers and demand significant…

Computation and Language · Computer Science 2024-11-01 Yu Zhang , Songlin Yang , Ruijie Zhu , Yue Zhang , Leyang Cui , Yiqiao Wang , Bolun Wang , Freda Shi , Bailin Wang , Wei Bi , Peng Zhou , Guohong Fu

In the realm of deep learning, spatial attention mechanisms have emerged as a vital method for enhancing the performance of convolutional neural networks. However, these mechanisms possess inherent limitations that cannot be overlooked.…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Xin Zhang , Chen Liu , Degang Yang , Tingting Song , Yichen Ye , Ke Li , Yingze Song

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Transformer-based models have been widely adopted for sentiment analysis tasks due to their exceptional ability to capture contextual information. However, these methods often exhibit suboptimal accuracy in certain scenarios. By analyzing…

Artificial Intelligence · Computer Science 2025-12-25 Yawei Liu

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a…

Computation and Language · Computer Science 2025-09-19 Yanming Kang , Giang Tran , Hans De Sterck

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention…

Machine Learning · Computer Science 2022-06-28 Weizhe Hua , Zihang Dai , Hanxiao Liu , Quoc V. Le
‹ Prev 1 2 3 10 Next ›