Related papers: Value-aware Approximate Attention

Memory-efficient Transformers via Top-$k$ Attention

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient,…

Computation and Language · Computer Science 2021-06-15 Ankit Gupta , Guy Dar , Shaya Goodman , David Ciprut , Jonathan Berant

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Continual Low-Rank Scaled Dot-product Attention

Transformers are widely used for their ability to capture data relations in sequence processing, with great success for a wide range of static tasks. However, the computational and memory footprint of their main component, i.e., the Scaled…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Ginés Carreto Picón , Illia Oleksiienko , Lukas Hedegaard , Arian Bakhtiarnia , Alexandros Iosifidis

LASER: Attention with Exponential Transformation

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's…

Machine Learning · Computer Science 2025-07-15 Sai Surya Duvvuri , Inderjit S. Dhillon

Object-aware Video-language Pre-training for Retrieval

Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2022-05-19 Alex Jinpeng Wang , Yixiao Ge , Guanyu Cai , Rui Yan , Xudong Lin , Ying Shan , Xiaohu Qie , Mike Zheng Shou

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Spectraformer: A Unified Random Feature Framework for Transformer

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We…

Machine Learning · Computer Science 2025-09-24 Duke Nguyen , Du Yin , Aditya Joshi , Flora Salim

EcoTransformer: Attention without Multiplication

The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer…

Machine Learning · Computer Science 2025-08-07 Xin Gao , Xingming Xu , Shirin Amiraslani , Hong Xu

Data-Aware Random Feature Kernel for Transformers

Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel…

Machine Learning · Computer Science 2026-03-05 Amirhossein Farzam , Hossein Mobahi , Nolan Andrew Miller , Luke Sernau

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce…

Machine Learning · Computer Science 2022-11-09 Uladzislau Yorsh , Alexander Kovalenko

The Effect of Attention Head Count on Transformer Approximation

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of…

Machine Learning · Computer Science 2026-04-01 Penghao Yu , Haotian Jiang , Zeyu Bao , Ruoxi Yu , Qianxiao Li

Dissecting Query-Key Interaction in Vision Transformers

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Xu Pan , Aaron Philip , Ziqian Xie , Odelia Schwartz

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification

Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations…

Computation and Language · Computer Science 2022-11-29 Nikolaos Mylonas , Ioannis Mollas , Grigorios Tsoumakas

Transformers as Support Vector Machines

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through…

Machine Learning · Computer Science 2024-02-23 Davoud Ataee Tarzanagh , Yingcong Li , Christos Thrampoulidis , Samet Oymak

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

Between Rate-Distortion Theory & Value Equivalence in Model-Based Reinforcement Learning

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with…

Machine Learning · Computer Science 2022-06-07 Dilip Arumugam , Benjamin Van Roy

FAVOR#: Sharp Attention Kernel Approximations via New Classes of Positive Random Features

The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in…

Machine Learning · Computer Science 2023-02-03 Valerii Likhosherstov , Krzysztof Choromanski , Avinava Dubey , Frederick Liu , Tamas Sarlos , Adrian Weller

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

Ripple Attention for Visual Perception with Sub-quadratic Complexity

Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied…

Computer Vision and Pattern Recognition · Computer Science 2022-06-16 Lin Zheng , Huijie Pan , Lingpeng Kong