English
Related papers

Related papers: Provably learning a multi-head attention layer

200 papers

We study the problem of learning Transformer-based sequence models with black-box access to their outputs. In this setting, a learner may adaptively query the oracle with any sequence of vectors and observe the output of the target…

Machine Learning · Computer Science 2026-05-05 Satwik Bhattamishra , Kulin Shah , Michael Hahn , Varun Kanade

Attention layers -- which map a sequence of inputs to a sequence of outputs -- are core building blocks of the Transformer architecture which has achieved significant breakthroughs in modern artificial intelligence. This paper presents a…

Machine Learning · Computer Science 2023-07-24 Hengyu Fu , Tianyu Guo , Yu Bai , Song Mei

Transformers have become the go-to architecture for language and vision tasks, yet their theoretical properties, especially memorization capacity, remain elusive. This paper investigates the memorization abilities of multi-head attention…

Machine Learning · Computer Science 2024-03-05 Sadegh Mahdavi , Renjie Liao , Christos Thrampoulidis

Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained,…

Machine Learning · Computer Science 2021-05-21 Jean-Baptiste Cordonnier , Andreas Loukas , Martin Jaggi

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-06 Wayner Barrios , SouYoung Jin

The Transformer translation model is based on the multi-head attention mechanism, which can be parallelized easily. The multi-head attention network performs the scaled dot-product attention function in parallel, empowering the model by…

Computation and Language · Computer Science 2021-09-13 Hongfei Xu , Qiuhui Liu , Josef van Genabith , Deyi Xiong

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input…

Computational Complexity · Computer Science 2026-03-13 Barna Saha , Yinzhan Xu , Christopher Ye , Hantao Yu

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns:…

Machine Learning · Computer Science 2025-05-29 Jianliang He , Xintian Pan , Siyu Chen , Zhuoran Yang

Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our…

Machine Learning · Computer Science 2025-10-27 Morris Yau , Ekin Akyürek , Jiayuan Mao , Joshua B. Tenenbaum , Stefanie Jegelka , Jacob Andreas

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

While Transformer networks benefit from a global receptive field, their quadratic cost relative to sequence length restricts their application to long sequences and high-resolution inputs. We introduce Fast Multipole Attention (FMA), a…

Computation and Language · Computer Science 2025-09-19 Yanming Kang , Giang Tran , Hans De Sterck

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their…

Machine Learning · Computer Science 2023-08-02 Yihe Dong , Jean-Baptiste Cordonnier , Andreas Loukas

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

We present a theoretical analysis of the performance of transformer with softmax attention in in-context learning with linear regression tasks. While the existing literature predominantly focuses on the convergence of transformers with…

Machine Learning · Computer Science 2024-02-01 Yingqian Cui , Jie Ren , Pengfei He , Jiliang Tang , Yue Xing

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to…

Neural and Evolutionary Computing · Computer Science 2019-11-07 Noam Shazeer

In-context learning (ICL) is a cornerstone of large language model (LLM) functionality, yet its theoretical foundations remain elusive due to the complexity of transformer architectures. In particular, most existing work only theoretically…

Machine Learning · Computer Science 2024-09-18 Siyu Chen , Heejune Sheen , Tianhao Wang , Zhuoran Yang

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Attention is a powerful and ubiquitous mechanism for allowing neural models to focus on particular salient pieces of information by taking their weighted average when making predictions. In particular, multi-headed attention is a driving…

Computation and Language · Computer Science 2019-11-05 Paul Michel , Omer Levy , Graham Neubig

In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index…

Machine Learning · Computer Science 2025-11-13 Emanuele Troiani , Hugo Cui , Yatin Dandi , Florent Krzakala , Lenka Zdeborová

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong
‹ Prev 1 2 3 10 Next ›