English
Related papers

Related papers: Value-aware Approximate Attention

200 papers

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient,…

Computation and Language · Computer Science 2021-06-15 Ankit Gupta , Guy Dar , Shaya Goodman , David Ciprut , Jonathan Berant

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Transformers are widely used for their ability to capture data relations in sequence processing, with great success for a wide range of static tasks. However, the computational and memory footprint of their main component, i.e., the Scaled…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Ginés Carreto Picón , Illia Oleksiienko , Lukas Hedegaard , Arian Bakhtiarnia , Alexandros Iosifidis

Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's…

Machine Learning · Computer Science 2025-07-15 Sai Surya Duvvuri , Inderjit S. Dhillon

Recently, by introducing large-scale dataset and strong transformer network, video-language pre-training has shown great success especially for retrieval. Yet, existing video-language transformer models do not explicitly fine-grained…

Computer Vision and Pattern Recognition · Computer Science 2022-05-19 Alex Jinpeng Wang , Yixiao Ge , Guanyu Cai , Rui Yan , Xudong Lin , Ying Shan , Xiaohu Qie , Mike Zheng Shou

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Linearization of attention using various kernel approximation and kernel learning techniques has shown promise. Past methods used a subset of combinations of component functions and weight matrices within the random feature paradigm. We…

Machine Learning · Computer Science 2025-09-24 Duke Nguyen , Du Yin , Aditya Joshi , Flora Salim

The Transformer, with its scaled dot-product attention mechanism, has become a foundational architecture in modern AI. However, this mechanism is computationally intensive and incurs substantial energy costs. We propose a new Transformer…

Machine Learning · Computer Science 2025-08-07 Xin Gao , Xingming Xu , Shirin Amiraslani , Hong Xu

Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel…

Machine Learning · Computer Science 2026-03-05 Amirhossein Farzam , Hossein Mobahi , Nolan Andrew Miller , Luke Sernau

In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce…

Machine Learning · Computer Science 2022-11-09 Uladzislau Yorsh , Alexander Kovalenko

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of…

Machine Learning · Computer Science 2026-04-01 Penghao Yu , Haotian Jiang , Zeyu Bao , Ruoxi Yu , Qianxiao Li

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Xu Pan , Aaron Philip , Ziqian Xie , Odelia Schwartz

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations…

Computation and Language · Computer Science 2022-11-29 Nikolaos Mylonas , Ioannis Mollas , Grigorios Tsoumakas

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through…

Machine Learning · Computer Science 2024-02-23 Davoud Ataee Tarzanagh , Yingcong Li , Christos Thrampoulidis , Samet Oymak

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with…

Machine Learning · Computer Science 2022-06-07 Dilip Arumugam , Benjamin Van Roy

The problem of efficient approximation of a linear operator induced by the Gaussian or softmax kernel is often addressed using random features (RFs) which yield an unbiased approximation of the operator's result. Such operators emerge in…

Machine Learning · Computer Science 2023-02-03 Valerii Likhosherstov , Krzysztof Choromanski , Avinava Dubey , Frederick Liu , Tamas Sarlos , Adrian Weller

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

Transformer architectures are now central to sequence modeling tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied…

Computer Vision and Pattern Recognition · Computer Science 2022-06-16 Lin Zheng , Huijie Pan , Lingpeng Kong
‹ Prev 1 2 3 10 Next ›