Related papers: Data-Aware Random Feature Kernel for Transformers

Spectraformer: A Unified Random Feature Framework for Transformer

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We…

Machine Learning · Computer Science 2025-09-24 Duke Nguyen , Du Yin , Aditya Joshi , Flora Salim

Macformer: Transformer with Random Maclaurin Feature Attention

Random feature attention (RFA) adopts random fourier feature (RFF) methods to approximate the softmax function, resulting in a linear time and space attention mechanism that enables the construction of an efficient Transformer. Inspired by…

Machine Learning · Computer Science 2024-08-22 Yuhan Guo , Lizhong Ding , Ye Yuan , Guoren Wang

Learning to Focus: Focal Attention for Selective and Scalable Transformers

Attention is a core component of transformer architecture, whether encoder-only, decoder-only, or encoder-decoder model. However, the standard softmax attention often produces noisy probability distribution, which can impair effective…

Computation and Language · Computer Science 2025-11-11 Dhananjay Ram , Wei Xia , Stefano Soatto

cosFormer: Rethinking Softmax in Attention

Transformer has shown great successes in natural language processing, computer vision, and audio processing. As one of its core components, the softmax attention helps to capture long-range dependencies yet prohibits its scale-up due to the…

Computation and Language · Computer Science 2022-02-18 Zhen Qin , Weixuan Sun , Hui Deng , Dongxu Li , Yunshen Wei , Baohong Lv , Junjie Yan , Lingpeng Kong , Yiran Zhong

Random Feature Attention

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not…

Computation and Language · Computer Science 2021-03-23 Hao Peng , Nikolaos Pappas , Dani Yogatama , Roy Schwartz , Noah A. Smith , Lingpeng Kong

Rethinking Attention with Performers

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on…

Machine Learning · Computer Science 2022-11-22 Krzysztof Choromanski , Valerii Likhosherstov , David Dohan , Xingyou Song , Andreea Gane , Tamas Sarlos , Peter Hawkins , Jared Davis , Afroz Mohiuddin , Lukasz Kaiser , David Belanger , Lucy Colwell , Adrian Weller

LaplacianFormer:Rethinking Linear Attention with Laplacian Kernel

The quadratic complexity of softmax attention presents a major obstacle for scaling Transformers to high-resolution vision tasks. Existing linear attention variants often replace the softmax with Gaussian kernels to reduce complexity, but…

Computer Vision and Pattern Recognition · Computer Science 2026-04-23 Zhe Feng , Sen Lian , Changwei Wang , Muyang Zhang , Tianlong Tan , Rongtao Xu , Weiliang Meng , Xiaopeng Zhang

PolaFormer: Polarity-aware Linear Attention for Vision Transformers

Linear attention has emerged as a promising alternative to softmax-based attention, leveraging kernelized feature maps to reduce complexity from quadratic to linear in sequence length. However, the non-negative constraint on feature maps…

Computer Vision and Pattern Recognition · Computer Science 2025-03-05 Weikang Meng , Yadan Luo , Xin Li , Dongmei Jiang , Zheng Zhang

LOTFormer: Doubly-Stochastic Linear Attention via Low-Rank Optimal Transport

Transformers have proven highly effective across modalities, but standard softmax attention scales quadratically with sequence length, limiting long context modeling. Linear attention mitigates this by approximating attention with kernel…

Machine Learning · Computer Science 2026-02-10 Ashkan Shahbazi , Chayne Thrash , Yikun Bai , Keaton Hamm , Navid NaderiAlizadeh , Soheil Kolouri

FreqFormer: Hierarchical Frequency-Domain Attention with Adaptive Spectral Routing for Long-Sequence Video Diffusion Transformers

Long-sequence video diffusion transformers hit a quadratic self-attention cost that dominates runtime and memory for very long token sequences. Most efficient attention methods use one approximation everywhere, yet video features are…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Haopeng Jin

Transformer with Fourier Integral Attentions

Multi-head attention empowers the recent success of transformers, the state-of-the-art models that have achieved remarkable success in sequence modeling and beyond. These attention mechanisms compute the pairwise dot products between the…

Machine Learning · Computer Science 2022-06-02 Tan Nguyen , Minh Pham , Tam Nguyen , Khai Nguyen , Stanley J. Osher , Nhat Ho

KDEformer: Accelerating Transformers via Kernel Density Estimation

Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact computation of this model incurs quadratic time and memory complexities in sequence length,…

Machine Learning · Computer Science 2023-06-30 Amir Zandieh , Insu Han , Majid Daliri , Amin Karbasi

EcoFormer: Energy-Saving Attention with Linear Complexity

Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Jing Liu , Zizheng Pan , Haoyu He , Jianfei Cai , Bohan Zhuang

DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation

Transformers have recently gained attention in the computer vision domain due to their ability to model long-range dependencies. However, the self-attention mechanism, which is the core part of the Transformer model, usually suffers from…

Computer Vision and Pattern Recognition · Computer Science 2023-07-28 Reza Azad , René Arimond , Ehsan Khodapanah Aghdam , Amirhossein Kazerouni , Dorit Merhof

DRAformer: Differentially Reconstructed Attention Transformer for Time-Series Forecasting

Time-series forecasting plays an important role in many real-world scenarios, such as equipment life cycle forecasting, weather forecasting, and traffic flow forecasting. It can be observed from recent research that a variety of…

Machine Learning · Computer Science 2022-06-14 Benhan Li , Shengdong Du , Tianrui Li , Jie Hu , Zhen Jia

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce…

Machine Learning · Computer Science 2022-11-09 Uladzislau Yorsh , Alexander Kovalenko

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However,…

Computation and Language · Computer Science 2023-10-20 Qingru Zhang , Dhananjay Ram , Cole Hawkins , Sheng Zha , Tuo Zhao

BiFormer: Vision Transformer with Bi-Level Routing Attention

As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Lei Zhu , Xinjiang Wang , Zhanghan Ke , Wayne Zhang , Rynson Lau

Variance-Reducing Couplings for Random Features

Random features (RFs) are a popular technique to scale up kernel methods in machine learning, replacing exact kernel evaluations with stochastic Monte Carlo estimates. They underpin models as diverse as efficient transformers (by…

Machine Learning · Statistics 2024-10-04 Isaac Reid , Stratis Markou , Krzysztof Choromanski , Richard E. Turner , Adrian Weller