Related papers: Thinking Like Transformers

Transformers as Transducers

We study the sequence-to-sequence mapping capacity of transformers by relating them to finite transducers, and find that they can express surprisingly large classes of transductions. We do so using variants of RASP, a programming language…

Formal Languages and Automata Theory · Computer Science 2024-11-07 Lena Strobl , Dana Angluin , David Chiang , Jonathan Rawski , Ashish Sabharwal

Learning Transformer Programs

Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short…

Machine Learning · Computer Science 2023-11-01 Dan Friedman , Alexander Wettig , Danqi Chen

Discovering Interpretable Algorithms by Decompiling Transformers to RASP

Recent work has shown that the computations of Transformers can be simulated in the RASP family of programming languages. These findings have enabled improved understanding of the expressive capacity and generalization abilities of…

Machine Learning · Computer Science 2026-02-10 Xinting Huang , Aleksandra Bakalova , Satwik Bhattamishra , William Merrill , Michael Hahn

On the Existence of Universal Simulators of Attention

Previous work on the learnability of transformers \textemdash\ focused primarily on examining their ability to approximate specific algorithmic patterns through training \textemdash\ has largely been data-driven, offering only probabilistic…

Machine Learning · Computer Science 2026-04-23 Debanjan Dutta , Anish Chakrabarty , Faizanuddin Ansari , Swagatam Das

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

An Introduction to Transformers

The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and…

Machine Learning · Computer Science 2026-01-21 Richard E. Turner

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

Understanding Transformers and Attention Mechanisms: An Introduction for Applied Mathematicians

This document provides a brief introduction to the attention mechanism used in modern language models based on the Transformer architecture. We first illustrate how text is encoded as vectors and how the attention mechanism processes these…

Numerical Analysis · Mathematics 2026-04-02 Michel Fabrice Serret

Transformers Learn Shortcuts to Automata

Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning…

Machine Learning · Computer Science 2023-05-03 Bingbin Liu , Jordan T. Ash , Surbhi Goel , Akshay Krishnamurthy , Cyril Zhang

Transformers as Graph-to-Graph Models

We argue that Transformers are essentially graph-to-graph models, with sequences just being a special case. Attention weights are functionally equivalent to graph edges. Our Graph-to-Graph Transformer architecture makes this ability…

Computation and Language · Computer Science 2023-10-30 James Henderson , Alireza Mohammadshahi , Andrei C. Coman , Lesly Miculicich

Transformer brain encoders explain human high-level visual responses

A major goal of neuroscience is to understand brain computations during visual processing in naturalistic settings. A dominant approach is to use image-computable deep neural networks trained with different task objectives as a basis for…

Neurons and Cognition · Quantitative Biology 2026-02-06 Hossein Adeli , Sun Minni , Nikolaus Kriegeskorte

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Transformers, parallel computation, and logarithmic depth

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is…

Machine Learning · Computer Science 2024-02-15 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Looped Transformers as Programmable Computers

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data…

Machine Learning · Computer Science 2023-01-31 Angeliki Giannou , Shashank Rajput , Jy-yong Sohn , Kangwook Lee , Jason D. Lee , Dimitris Papailiopoulos

RealFormer: Transformer Likes Residual Attention

Transformer is the backbone of modern NLP models. In this paper, we propose RealFormer, a simple and generic technique to create Residual Attention Layer Transformer networks that significantly outperform the canonical Transformer and its…

Machine Learning · Computer Science 2021-09-14 Ruining He , Anirudh Ravula , Bhargav Kanagal , Joshua Ainslie

Attention as Binding: A Vector-Symbolic Perspective on Transformer Reasoning

Transformer-based language models display impressive reasoning-like behavior, yet remain brittle on tasks that require stable symbolic manipulation. This paper develops a unified perspective on these phenomena by interpreting self-attention…

Artificial Intelligence · Computer Science 2025-12-18 Sahil Rajesh Dhayalkar

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Addressing Some Limitations of Transformers with Feedback Memory

Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. Unlike recurrent neural networks, Transformers use attention to capture temporal relations while processing input tokens in…

Machine Learning · Computer Science 2021-01-26 Angela Fan , Thibaut Lavril , Edouard Grave , Armand Joulin , Sainbayar Sukhbaatar

RingFormer: Rethinking Recurrent Transformer with Adaptive Level Signals

Transformers have achieved great success in effectively processing sequential data such as text. Their architecture consisting of several attention and feedforward blocks can model relations between elements of a sequence in parallel…

Machine Learning · Computer Science 2025-02-20 Jaemu Heo , Eldor Fozilov , Hyunmin Song , Taehwan Kim

Neural Decompiling of Tracr Transformers

Recently, the transformer architecture has enabled substantial progress in many areas of pattern recognition and machine learning. However, as with other neural network models, there is currently no general method available to explain their…

Machine Learning · Computer Science 2024-12-02 Hannes Thurnherr , Kaspar Riesen