Related papers: Attention Enables Zero Approximation Error

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different…

Computation and Language · Computer Science 2020-10-06 Alessandro Raganato , Yves Scherrer , Jörg Tiedemann

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by…

Machine Learning · Computer Science 2020-01-01 Thomas Dowdell , Hongyu Zhang

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

Combiner: Full Attention Transformer with Sparse Computation Cost

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the…

Machine Learning · Computer Science 2021-10-29 Hongyu Ren , Hanjun Dai , Zihang Dai , Mengjiao Yang , Jure Leskovec , Dale Schuurmans , Bo Dai

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their…

Machine Learning · Computer Science 2023-08-02 Yihe Dong , Jean-Baptiste Cordonnier , Andreas Loukas

The Effect of Attention Head Count on Transformer Approximation

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of…

Machine Learning · Computer Science 2026-04-01 Penghao Yu , Haotian Jiang , Zeyu Bao , Ruoxi Yu , Qianxiao Li

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism.…

Computation and Language · Computer Science 2023-08-03 Ashish Vaswani , Noam Shazeer , Niki Parmar , Jakob Uszkoreit , Llion Jones , Aidan N. Gomez , Lukasz Kaiser , Illia Polosukhin

Attention-based clustering

Transformers have emerged as a powerful neural network architecture capable of tackling a wide range of learning tasks. In this work, we provide a theoretical analysis of their ability to automatically extract structure from data in an…

Machine Learning · Statistics 2025-10-29 Rodrigo Maulen-Soto , Pierre Marion , Claire Boyer

On the Usefulness of Self-Attention for Automatic Speech Recognition with Transformers

Self-attention models such as Transformers, which can capture temporal relationships without being limited by the distance between events, have given competitive speech recognition results. However, we note the range of the learned context…

Computation and Language · Computer Science 2020-11-11 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Attention layers -- which map a sequence of inputs to a sequence of outputs -- are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a…

Machine Learning · Computer Science 2023-07-24 Hengyu Fu , Tianyu Guo , Yu Bai , Song Mei

Transformers as Support Vector Machines

Since its inception in "Attention Is All You Need", transformer architecture has led to revolutionary advancements in NLP. The attention layer within the transformer admits a sequence of input tokens $X$ and makes them interact through…

Machine Learning · Computer Science 2024-02-23 Davoud Ataee Tarzanagh , Yingcong Li , Christos Thrampoulidis , Samet Oymak

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Is Random Attention Sufficient for Sequence Modeling? Disentangling Trainable Components in the Transformer

The transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of tasks - including mathematical reasoning, memorization, and retrieval - using…

Machine Learning · Computer Science 2025-09-05 Yihe Dong , Lorenzo Noci , Mikhail Khodak , Mufan Li

On the Optimization and Generalization of Multi-head Attention

The training and generalization dynamics of the Transformer's core mechanism, namely the Attention mechanism, remain under-explored. Besides, existing analyses primarily focus on single-head attention. Inspired by the demonstrated benefits…

Machine Learning · Computer Science 2024-10-15 Puneesh Deora , Rouzbeh Ghaderi , Hossein Taheri , Christos Thrampoulidis

Revisiting Transformers with Insights from Image Filtering and Boosting

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Laziz U. Abdullaev , Maksim Tkachenko , Tan M. Nguyen

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a…

Machine Learning · Computer Science 2022-10-28 Sungjun Cho , Seonwoo Min , Jinwoo Kim , Moontae Lee , Honglak Lee , Seunghoon Hong

Set Transformer: A Framework for Attention-based Permutation-Invariant Neural Networks

Many machine learning tasks such as multiple instance learning, 3D shape recognition, and few-shot image classification are defined on sets of instances. Since solutions to such problems do not depend on the order of elements of the set,…

Machine Learning · Computer Science 2019-05-28 Juho Lee , Yoonho Lee , Jungtaek Kim , Adam R. Kosiorek , Seungjin Choi , Yee Whye Teh

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers

Transformers are increasingly dominating multi-modal reasoning tasks, such as visual question answering, achieving state-of-the-art results thanks to their ability to contextualize information using the self-attention and co-attention…

Computer Vision and Pattern Recognition · Computer Science 2021-03-30 Hila Chefer , Shir Gur , Lior Wolf

Leaner Transformers: More Heads, Less Depth

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means…

Machine Learning · Computer Science 2025-05-28 Hemanth Saratchandran , Damien Teney , Simon Lucey

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong