English
Related papers

Related papers: Concise One-Layer Transformers Can Do Function Eva…

200 papers

We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is…

Machine Learning · Computer Science 2024-02-15 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Which transformer scaling regimes are able to perfectly solve different classes of algorithmic problems? While tremendous empirical advances have been attained by transformer-based neural networks, a theoretical understanding of their…

Transformers serve as the foundational architecture for large language and video generation models, such as GPT, BERT, SORA and their successors. Empirical studies have demonstrated that real-world data and learning tasks exhibit…

Machine Learning · Computer Science 2026-05-19 Zhaiming Shen , Alex Havrilla , Rongjie Lai , Alexander Cloninger , Wenjing Liao

One way to interpret the reasoning power of transformer-based language models is to describe the types of logical rules they can resolve over some input text. Recently, Chiang et al. (2023) showed that finite-precision transformers can be…

Machine Learning · Computer Science 2025-09-12 William Merrill , Ashish Sabharwal

Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to…

Machine Learning · Computer Science 2025-11-07 William Merrill , Ashish Sabharwal

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to…

Machine Learning · Computer Science 2024-04-03 Xingwu Chen , Difan Zou

A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.

Machine Learning · Computer Science 2024-08-27 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Logical reasoning is central to complex human activities, such as thinking, debating, and planning; it is also a central component of many AI systems as well. In this paper, we investigate the extent to which encoder-only transformer…

Computation and Language · Computer Science 2024-07-02 Paulo Pirozelli , Marcos M. José , Paulo de Tarso P. Filho , Anarosa A. F. Brandão , Fabio G. Cozman

The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to…

Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and…

Computation and Language · Computer Science 2025-07-11 Tianshi Zheng , Jiazheng Wang , Zihao Wang , Jiaxin Bai , Hang Yin , Zheye Deng , Yangqiu Song , Jianxin Li

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both…

Machine Learning · Computer Science 2023-11-17 Clayton Sanford , Daniel Hsu , Matus Telgarsky

Trained transformer models have been found to implement interpretable procedures for tasks like arithmetic and associative recall, but little is understood about how the circuits that implement these procedures originate during training. To…

Machine Learning · Computer Science 2024-10-08 Ziqian Zhong , Jacob Andreas

Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the…

Machine Learning · Computer Science 2024-09-02 Swaroop Nath , Harshad Khadilkar , Pushpak Bhattacharyya

Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional…

Machine Learning · Computer Science 2022-12-13 Yuxuan Li , James L. McClelland

Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for…

Machine Learning · Computer Science 2024-07-02 Jannik Brinkmann , Abhay Sheshadri , Victor Levoso , Paul Swoboda , Christian Bartelt

As transformers have gained prominence in natural language processing, some researchers have investigated theoretically what problems they can and cannot solve, by treating problems as formal languages. Exploring such questions can help…

Machine Learning · Computer Science 2024-09-05 Lena Strobl , William Merrill , Gail Weiss , David Chiang , Dana Angluin

The study on the expressive power of transformers shows that transformers are permutation equivariant, and they can approximate all permutation-equivariant continuous functions on a compact domain. However, these results are derived under…

Machine Learning · Computer Science 2026-01-26 Sejun Park , Yeachan Park , Geonho Hwang

Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple $1$-layer case. Due to the difficulty of…

Machine Learning · Computer Science 2024-12-05 Lijie Chen , Binghui Peng , Hongxun Wu

Transformers have become the foundational architecture for a broad spectrum of sequence modeling applications, underpinning state-of-the-art systems in natural language processing, vision, and beyond. However, their theoretical limitations…

Computation and Language · Computer Science 2026-02-13 Michelle Yuan , Weiyi Sun , Amir H. Rezaeian , Jyotika Singh , Sandip Ghoshal , Yao-Ting Wang , Miguel Ballesteros , Yassine Benajiba

Counting properties (e.g. determining whether certain tokens occur more than other tokens in a given input text) have played a significant role in the study of expressiveness of transformers. In this paper, we provide a formal framework for…

Computation and Language · Computer Science 2026-03-03 Marco Sälzer , Chris Köcher , Alexander Kozachinskiy , Georg Zetzsche , Anthony Widjaja Lin
‹ Prev 1 2 3 10 Next ›