Related papers: Exact Sequence Interpolation with Transformers

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

The Effect of Attention Head Count on Transformer Approximation

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of…

Machine Learning · Computer Science 2026-04-01 Penghao Yu , Haotian Jiang , Zeyu Bao , Ruoxi Yu , Qianxiao Li

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input…

Machine Learning · Computer Science 2025-06-23 Hantao Yu , Josh Alman

Combiner: Full Attention Transformer with Sparse Computation Cost

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the…

Machine Learning · Computer Science 2021-10-29 Hongyu Ren , Hanjun Dai , Zihang Dai , Mengjiao Yang , Jure Leskovec , Dale Schuurmans , Bo Dai

From Interpolation to Extrapolation: Complete Length Generalization for Arithmetic Transformers

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving…

Machine Learning · Computer Science 2024-05-13 Shaoxiong Duan , Yining Shi , Wei Xu

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the…

Machine Learning · Computer Science 2020-09-01 Angelos Katharopoulos , Apoorv Vyas , Nikolaos Pappas , François Fleuret

Measure-to-measure interpolation using Transformers

Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map…

Optimization and Control · Mathematics 2026-02-16 Borjan Geshkovski , Philippe Rigollet , Domènec Ruiz-Balet

Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input

Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood. In this paper, we study the approximation and estimation…

Machine Learning · Computer Science 2024-03-26 Shokichi Takakura , Taiji Suzuki

Transformers Are Universally Consistent

Despite their central role in the success of foundational models and large-scale language modeling, the theoretical foundations governing the operation of Transformers remain only partially understood. Contemporary research has largely…

Machine Learning · Computer Science 2025-06-02 Sagar Ghosh , Kushal Bose , Swagatam Das

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation…

Machine Learning · Computer Science 2020-02-26 Chulhee Yun , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Centroid Transformers: Learning to Abstract with Attention

Self-attention, as the key block of transformers, is a powerful mechanism for extracting features from the inputs. In essence, what self-attention does is to infer the pairwise relations between the elements of the inputs, and modify the…

Machine Learning · Computer Science 2021-03-09 Lemeng Wu , Xingchao Liu , Qiang Liu

Sampled Transformer for Point Sets

The sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is…

Machine Learning · Computer Science 2023-03-01 Shidi Li , Christian Walder , Alexander Soen , Lexing Xie , Miaomiao Liu

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Music Transformer

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani…

Machine Learning · Computer Science 2018-12-13 Cheng-Zhi Anna Huang , Ashish Vaswani , Jakob Uszkoreit , Noam Shazeer , Ian Simon , Curtis Hawthorne , Andrew M. Dai , Matthew D. Hoffman , Monica Dinculescu , Douglas Eck

$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has…

Machine Learning · Computer Science 2020-12-22 Chulhee Yun , Yin-Wen Chang , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Linear Transformers Implicitly Discover Unified Numerical Algorithms

We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr\"om…

Machine Learning · Computer Science 2025-09-25 Patrick Lutz , Aditya Gangrade , Hadi Daneshmand , Venkatesh Saligrama

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various…

Machine Learning · Computer Science 2025-09-22 Saeed Amizadeh , Sara Abdali , Yinheng Li , Kazuhito Koishida

Minimal Time Series Transformer

Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so…

Machine Learning · Computer Science 2025-03-14 Joni-Kristian Kämäräinen

Approximation of Permutation Invariant Polynomials by Transformers: Efficient Construction in Column-Size

Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of…

Machine Learning · Computer Science 2025-02-18 Naoki Takeshita , Masaaki Imaizumi