Related papers: Linear Self-Attention Approximation via Trainable …

Empowering parameter-efficient transfer learning by recognizing the kernel structure in self-attention

The massive amount of trainable parameters in the pre-trained language models (PLMs) makes them hard to be deployed to multiple downstream tasks. To address this issue, parameter-efficient transfer learning methods have been proposed to…

Computation and Language · Computer Science 2022-10-27 Yifan Chen , Devamanyu Hazarika , Mahdi Namazifar , Yang Liu , Di Jin , Dilek Hakkani-Tur

Spectraformer: A Unified Random Feature Framework for Transformer

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We…

Machine Learning · Computer Science 2025-09-24 Duke Nguyen , Du Yin , Aditya Joshi , Flora Salim

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Efficient Attention Mechanisms for Large Language Models: A Survey

Transformer-based architectures have become the prevailing backbone of large language models. However, the quadratic time and memory complexity of self-attention remains a fundamental obstacle to efficient long-context modeling. To address…

Computation and Language · Computer Science 2026-02-10 Yutao Sun , Zhenyu Li , Yike Zhang , Tengyu Pan , Bowen Dong , Yuyi Guo , Jianyong Wang

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To…

Computation and Language · Computer Science 2025-10-27 Mutian He , Philip N. Garner

$k$NN Attention Demystified: A Theoretical Exploration for Scalable Transformers

Despite their power, Transformers face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like $k$-Nearest-Neighbor ($k$NN) attention have been introduced [Roy, Saffar,…

Machine Learning · Computer Science 2024-11-11 Themistoklis Haris

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

Numerical Analysis · Mathematics 2026-04-03 Michel Fabrice Serret , Alice Cortinovis , Yijun Dong , Diana Halikias , Anna Ma , Fabio Matti , Deanna Needell , Katherine J. Pearce , Elizaveta Rebrova , Disha Shur , Rudi Smith , Hai-Xiao Wang , Laura Grigori

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Designing Robust Transformers using Robust Kernel Density Estimation

Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering…

Machine Learning · Computer Science 2023-11-09 Xing Han , Tongzheng Ren , Tan Minh Nguyen , Khai Nguyen , Joydeep Ghosh , Nhat Ho

Position-Aware Sequential Attention for Accurate Next Item Recommendations

Sequential self-attention models usually rely on additive positional embeddings, which inject positional information into item representations at the input. In the absence of positional signals, the attention block is…

Information Retrieval · Computer Science 2026-02-25 Timur Nabiev , Evgeny Frolov

Efficient Linear Attention for Fast and Accurate Keypoint Matching

Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their…

Computer Vision and Pattern Recognition · Computer Science 2022-04-25 Suwichaya Suwanwimolkul , Satoshi Komorita

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by…

Computation and Language · Computer Science 2021-09-03 Chuhan Wu , Fangzhao Wu , Tao Qi , Binxing Jiao , Daxin Jiang , Yongfeng Huang , Xing Xie

FLatten Transformer: Vision Transformer using Focused Linear Attention

The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Dongchen Han , Xuran Pan , Yizeng Han , Shiji Song , Gao Huang

Transformer Based Linear Attention with Optimized GPU Kernel Implementation

The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$.…

Machine Learning · Computer Science 2025-10-28 Armin Gerami , Ramani Duraiswami

On The Computational Complexity of Self-Attention

Transformer architectures have led to remarkable progress in many state-of-art applications. However, despite their successes, modern transformers rely on the self-attention mechanism, whose time- and space-complexity is quadratic in the…

Machine Learning · Computer Science 2022-09-13 Feyza Duman Keles , Pruthuvi Mahesakya Wijewardena , Chinmay Hegde

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied…

Computer Vision and Pattern Recognition · Computer Science 2022-06-16 Lin Zheng , Huijie Pan , Lingpeng Kong

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Treeformer: Dense Gradient Trees for Efficient Attention Computation

Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc.…

Computation and Language · Computer Science 2023-03-20 Lovish Madaan , Srinadh Bhojanapalli , Himanshu Jain , Prateek Jain