Related papers: Transformers, parallel computation, and logarithmi…

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input…

Computational Complexity · Computer Science 2023-04-28 William Merrill , Ashish Sabharwal

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to…

Machine Learning · Computer Science 2024-04-03 Xingwu Chen , Difan Zou

Understanding Transformer Reasoning Capabilities via Graph Algorithms

Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their…

Machine Learning · Computer Science 2024-05-30 Clayton Sanford , Bahare Fatemi , Ethan Hall , Anton Tsitsulin , Mehran Kazemi , Jonathan Halcrow , Bryan Perozzi , Vahab Mirrokni

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes)

While transformers have proven enormously successful in a range of tasks, their fundamental properties as models of computation are not well understood. This paper contributes to the study of the expressive capacity of transformers,…

Machine Learning · Computer Science 2025-03-31 Lena Strobl , Dana Angluin , Robert Frank

Leaner Transformers: More Heads, Less Depth

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means…

Machine Learning · Computer Science 2025-05-28 Hemanth Saratchandran , Damien Teney , Simon Lucey

Linear Transformers are Versatile In-Context Learners

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in…

Machine Learning · Computer Science 2024-10-31 Max Vladymyrov , Johannes von Oswald , Mark Sandler , Rong Ge

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer…

Machine Learning · Computer Science 2026-04-24 Costin-Andrei Oncescu , Depen Morwani , Samy Jelassi , Alexandru Meterez , Mujin Kwun , Sham Kakade

Representational Strengths and Limitations of Transformers

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both…

Machine Learning · Computer Science 2023-11-17 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Fast attention mechanisms: a tale of parallelism

Transformers have the representational capacity to simulate Massively Parallel Computation (MPC) algorithms, but they suffer from quadratic time complexity, which severely limits their scalability. We introduce an efficient attention…

Machine Learning · Computer Science 2025-09-12 Jingwen Liu , Hantao Yu , Clayton Sanford , Alexandr Andoni , Daniel Hsu

Parallelizing Quantum Circuits

We present a novel automated technique for parallelizing quantum circuits via forward and backward translation to measurement-based quantum computing patterns and analyze the trade off in terms of depth and space complexity. As a result we…

Quantum Physics · Physics 2012-02-22 Anne Broadbent , Elham Kashefi

Thinking Like Transformers

What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no…

Machine Learning · Computer Science 2021-07-20 Gail Weiss , Yoav Goldberg , Eran Yahav

Some Notes on Parallel Quantum Computation

We exhibit some simple gadgets useful in designing shallow parallel circuits for quantum algorithms. We prove that any quantum circuit composed entirely of controlled-not gates or of diagonal gates can be parallelized to logarithmic depth,…

Quantum Physics · Physics 2009-09-25 Cristopher Moore , Martin Nilsson

Multihead self-attention in cortico-thalamic circuits

Both biological cortico-thalamic networks and artificial transformer networks use canonical computations to perform a wide range of cognitive tasks. In this work, we propose that the structure of cortico-thalamic circuits is well suited to…

Neurons and Cognition · Quantitative Biology 2025-08-12 Arno Granier , Walter Senn

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Looped Transformers as Programmable Computers

We present a framework for using transformer networks as universal computers by programming them with specific weights and placing them in a loop. Our input sequence acts as a punchcard, consisting of instructions and memory for data…

Machine Learning · Computer Science 2023-01-31 Angeliki Giannou , Shashank Rajput , Jy-yong Sohn , Kangwook Lee , Jason D. Lee , Dimitris Papailiopoulos

Thinking Deeper, Not Longer: Depth-Recurrent Transformers for Compositional Generalization

Standard Transformers have a fixed computational depth, fundamentally limiting their ability to generalize to tasks requiring variable-depth reasoning, such as multi-hop graph traversal or nested logic. We propose a depth-recurrent…

Machine Learning · Computer Science 2026-03-24 Hung-Hsuan Chen

Efficient Parallel Samplers for Recurrent-Depth Models and Their Connection to Diffusion Language Models

Language models with recurrent depth, also referred to as universal or looped when considering transformers, are defined by the capacity to increase their computation through the repetition of layers. Recent efforts in pretraining have…

Machine Learning · Computer Science 2025-10-17 Jonas Geiping , Xinyu Yang , Guinan Su

Parallel Training of Deep Networks with Local Updates

Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times…

Machine Learning · Computer Science 2021-06-16 Michael Laskin , Luke Metz , Seth Nabarro , Mark Saroufim , Badreddine Noune , Carlo Luschi , Jascha Sohl-Dickstein , Pieter Abbeel

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

In-Context Learning of Linear Dynamical Systems with Transformers: Approximation Bounds and Depth-Separation

This paper investigates approximation-theoretic aspects of the in-context learning capability of the transformers in representing a family of noisy linear dynamical systems. Our first theoretical result establishes an upper bound on the…

Machine Learning · Computer Science 2025-10-22 Frank Cole , Yuxuan Zhao , Yulong Lu , Tianhao Zhang