Related papers: Provably learning a multi-head attention layer

Provably Learning Attention with Queries

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target…

Machine Learning · Computer Science 2026-05-05 Satwik Bhattamishra , Kulin Shah , Michael Hahn , Varun Kanade

What can a Single Attention Layer Learn? A Study Through the Random Features Lens

Attention layers -- which map a sequence of inputs to a sequence of outputs -- are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a…

Machine Learning · Computer Science 2023-07-24 Hengyu Fu , Tianyu Guo , Yu Bai , Song Mei

Memorization Capacity of Multi-Head Attention in Transformers

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention…

Machine Learning · Computer Science 2024-03-05 Sadegh Mahdavi , Renjie Liao , Christos Thrampoulidis

Multi-Head Attention: Collaborate Instead of Concatenate

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained,…

Machine Learning · Computer Science 2021-05-21 Jean-Baptiste Cordonnier , Andreas Loukas , Martin Jaggi

Multi-layer Learnable Attention Mask for Multimodal Tasks

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-06 Wayner Barrios , SouYoung Jin

Learning Hard Retrieval Decoder Attention for Transformers

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by…

Computation and Language · Computer Science 2021-09-13 Hongfei Xu , Qiuhui Liu , Josef van Genabith , Deyi Xiong

On the Computational Hardness of Transformers

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input…

Computational Complexity · Computer Science 2026-03-13 Barna Saha , Yinzhan Xu , Christopher Ye , Hantao Yu

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns:…

Machine Learning · Computer Science 2025-05-29 Jianliang He , Xintian Pan , Siyu Chen , Zhuoran Yang

Learning Linear Attention in Polynomial Time

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our…

Machine Learning · Computer Science 2025-10-27 Morris Yau , Ekin Akyürek , Jiayuan Mao , Joshua B. Tenenbaum , Stefanie Jegelka , Jacob Andreas

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

Fast Multipole Attention: A Scalable Multilevel Attention Mechanism for Text and Images

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a…

Computation and Language · Computer Science 2025-09-19 Yanming Kang , Giang Tran , Hans De Sterck

Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their…

Machine Learning · Computer Science 2023-08-02 Yihe Dong , Jean-Baptiste Cordonnier , Andreas Loukas

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

Superiority of Multi-Head Attention in In-Context Linear Regression

We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with…

Machine Learning · Computer Science 2024-02-01 Yingqian Cui , Jie Ren , Pengfei He , Jiliang Tang , Yue Xing

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to…

Neural and Evolutionary Computing · Computer Science 2019-11-07 Noam Shazeer

Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically…

Machine Learning · Computer Science 2024-09-18 Siyu Chen , Heejune Sheen , Tianhao Wang , Zhuoran Yang

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Are Sixteen Heads Really Better than One?

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving…

Computation and Language · Computer Science 2019-11-05 Paul Michel , Omer Levy , Graham Neubig

Fundamental limits of learning in sequence multi-index models and deep attention networks: High-dimensional asymptotics and sharp thresholds

In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index…

Machine Learning · Computer Science 2025-11-13 Emanuele Troiani , Hugo Cui , Yatin Dandi , Florent Krzakala , Lenka Zdeborová

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong