Related papers: Looped Transformers as Programmable Computers

Looped ReLU MLPs May Be All You Need as Practical Programmable Computers

Previous work has demonstrated that attention mechanisms are Turing complete. More recently, it has been shown that a looped 9-layer Transformer can function as a universal programmable computer. In contrast, the multi-layer perceptrons…

Machine Learning · Computer Science 2025-02-21 Yingyu Liang , Zhizhou Sha , Zhenmei Shi , Zhao Song , Yufa Zhou

On the Existence of Universal Simulators of Attention

Previous work on the learnability of transformers \textemdash\ focused primarily on examining their ability to approximate specific algorithmic patterns through training \textemdash\ has largely been data-driven, offering only probabilistic…

Machine Learning · Computer Science 2026-04-23 Debanjan Dutta , Anish Chakrabarty , Faizanuddin Ansari , Swagatam Das

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no…

Machine Learning · Computer Science 2021-07-20 Gail Weiss , Yoav Goldberg , Eran Yahav

Average Attention Transformers and Arithmetic Circuits

We analyse the computational power of transformer encoders as sequence-to-sequence functions on vectors. We show that average hard attention can be used to simulate arithmetic circuits if they are given as an input to an encoder. The…

Computational Complexity · Computer Science 2026-05-07 Lena Ehrmuth , Laura Strieker

Transformers, parallel computation, and logarithmic depth

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is…

Machine Learning · Computer Science 2024-02-15 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Training Transformers as a Universal Computer

We demonstrate that a small transformer can learn to execute programs in MicroPy, a simplified yet computationally universal programming language. Given procedure definitions together with an expression to evaluate, the transformer predicts…

Artificial Intelligence · Computer Science 2026-04-29 Ruize Xu , Chenxiao Yang , Yanhong Li , David McAllester

Representational Strengths and Limitations of Transformers

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both…

Machine Learning · Computer Science 2023-11-17 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation…

Machine Learning · Computer Science 2020-02-26 Chulhee Yun , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Error Correction Code Transformer

Error correction code is a major part of the communication physical layer, ensuring the reliable transfer of data over noisy channels. Recently, neural decoders were shown to outperform classical decoding techniques. However, the existing…

Machine Learning · Computer Science 2022-03-30 Yoni Choukroun , Lior Wolf

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to…

Machine Learning · Computer Science 2024-04-03 Xingwu Chen , Difan Zou

Looped Transformers are Better at Learning Learning Algorithms

Transformers have demonstrated effectiveness in in-context solving data-fitting problems from various (latent) models, as reported by Garg et al. However, the absence of an inherent iterative structure in the transformer architecture…

Machine Learning · Computer Science 2024-03-19 Liu Yang , Kangwook Lee , Robert Nowak , Dimitris Papailiopoulos

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers,…

Machine Learning · Computer Science 2025-03-31 Lena Strobl , Dana Angluin , Robert Frank

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and…

Computation and Language · Computer Science 2026-02-04 Ning Ding , Yehui Tang , Haochen Qin , Zhenli Zhou , Chao Xu , Lin Li , Kai Han , Heng Liao , Yunhe Wang

Small transformer architectures for task switching

The rapid progress seen in terms of large-scale generative AI is largely based on the attention mechanism. It is conversely non-trivial to conceive small-scale applications for which attention-based architectures outperform traditional…

Machine Learning · Computer Science 2025-08-07 Claudius Gros

Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as an Alternative to Attention Layers in Transformers

This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model, a state-of-the-art architecture for sequence-to-sequence…

Computation and Language · Computer Science 2024-02-06 Vukasin Bozic , Danilo Dordevic , Daniele Coppola , Joseph Thommes , Sidak Pal Singh

Equivalent Linear Mappings of Large Language Models

Despite significant progress in transformer interpretability, an understanding of the computational mechanisms of large language models (LLMs) remains a fundamental challenge. Many approaches interpret a network's hidden representations but…

Machine Learning · Computer Science 2025-10-14 James R. Golden

Universal Approximation Theorem for a Single-Layer Transformer

Deep learning employs multi-layer neural networks trained via the backpropagation algorithm. This approach has achieved success across many domains and relies on adaptive gradient methods such as the Adam optimizer. Sequence modeling…

Machine Learning · Computer Science 2025-07-16 Esmail Gumaan

Quantum Vision Transformers

In this work, quantum transformers are designed and analysed in detail by extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis.…

Quantum Physics · Physics 2024-02-28 El Amine Cherrat , Iordanis Kerenidis , Natansh Mathur , Jonas Landman , Martin Strahm , Yun Yvonna Li

Attention-Only Transformers and Implementing MLPs with Attention Heads

The transformer architecture is widely used in machine learning models and consists of two alternating sublayers: attention heads and MLPs. We prove that an MLP neuron can be implemented by a masked attention head with internal dimension 1…

Machine Learning · Computer Science 2023-09-18 Robert Huben , Valerie Morris

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong