English
Related papers

Related papers: Exact Sequence Interpolation with Transformers

200 papers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of…

Machine Learning · Computer Science 2026-04-01 Penghao Yu , Haotian Jiang , Zeyu Bao , Ruoxi Yu , Qianxiao Li

The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input…

Machine Learning · Computer Science 2025-06-23 Hantao Yu , Josh Alman

Transformers provide a class of expressive architectures that are extremely effective for sequence modeling. However, the key limitation of transformers is their quadratic memory and time complexity $\mathcal{O}(L^2)$ with respect to the…

Machine Learning · Computer Science 2021-10-29 Hongyu Ren , Hanjun Dai , Zihang Dai , Mengjiao Yang , Jure Leskovec , Dale Schuurmans , Bo Dai

In this paper, we investigate the inherent capabilities of transformer models in learning arithmetic algorithms, such as addition and parity. Through experiments and attention analysis, we identify a number of crucial factors for achieving…

Machine Learning · Computer Science 2024-05-13 Shaoxiong Duan , Yining Shi , Wei Xu

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the…

Machine Learning · Computer Science 2020-09-01 Angelos Katharopoulos , Apoorv Vyas , Nikolaos Pappas , François Fleuret

Transformers are deep neural network architectures that underpin the recent successes of large language models. Unlike more classical architectures that can be viewed as point-to-point maps, a Transformer acts as a measure-to-measure map…

Optimization and Control · Mathematics 2026-02-16 Borjan Geshkovski , Philippe Rigollet , Domènec Ruiz-Balet

Despite the great success of Transformer networks in various applications such as natural language processing and computer vision, their theoretical aspects are not well understood. In this paper, we study the approximation and estimation…

Machine Learning · Computer Science 2024-03-26 Shokichi Takakura , Taiji Suzuki

Despite their central role in the success of foundational models and large-scale language modeling, the theoretical foundations governing the operation of Transformers remain only partially understood. Contemporary research has largely…

Machine Learning · Computer Science 2025-06-02 Sagar Ghosh , Kushal Bose , Swagatam Das

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation…

Machine Learning · Computer Science 2020-02-26 Chulhee Yun , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Self-attention, as the key block of transformers, is a powerful mechanism for extracting features from the inputs. In essence, what self-attention does is to infer the pairwise relations between the elements of the inputs, and modify the…

Machine Learning · Computer Science 2021-03-09 Lemeng Wu , Xingchao Liu , Qiang Liu

The sparse transformer can reduce the computational complexity of the self-attention layers to $O(n)$, whilst still being a universal approximator of continuous sequence-to-sequence functions. However, this permutation variant operation is…

Machine Learning · Computer Science 2023-03-01 Shidi Li , Christian Walder , Alexander Soen , Lexing Xie , Miaomiao Liu

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces with ABA structure. The Transformer (Vaswani…

Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has…

Machine Learning · Computer Science 2020-12-22 Chulhee Yun , Yin-Wen Chang , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

We train a linear attention transformer on millions of masked-block matrix completion tasks: each prompt is masked low-rank matrix whose missing block may be (i) a scalar prediction target or (ii) an unseen kernel slice of Nystr\"om…

Machine Learning · Computer Science 2025-09-25 Patrick Lutz , Aditya Gangrade , Hadi Daneshmand , Venkatesh Saligrama

Transformers and their attention mechanism have been revolutionary in the field of Machine Learning. While originally proposed for the language data, they quickly found their way to the image, video, graph, etc. data modalities with various…

Machine Learning · Computer Science 2025-09-22 Saeed Amizadeh , Sara Abdali , Yinheng Li , Kazuhito Koishida

Transformer is the state-of-the-art model for many natural language processing, computer vision, and audio analysis problems. Transformer effectively combines information from the past input and output samples in auto-regressive manner so…

Machine Learning · Computer Science 2025-03-14 Joni-Kristian Kämäräinen

Transformers are a type of neural network that have demonstrated remarkable performance across various domains, particularly in natural language processing tasks. Motivated by this success, research on the theoretical understanding of…

Machine Learning · Computer Science 2025-02-18 Naoki Takeshita , Masaaki Imaizumi
‹ Prev 1 2 3 10 Next ›