Related papers: Sumformer: Universal Approximation for Efficient T…

Are Transformers universal approximators of sequence-to-sequence functions?

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation…

Machine Learning · Computer Science 2020-02-26 Chulhee Yun , Srinadh Bhojanapalli , Ankit Singh Rawat , Sashank J. Reddi , Sanjiv Kumar

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

Dynamic Universal Approximation Theory: The Basic Theory for Transformer-based Large Language Models

Language models have emerged as a critical area of focus in artificial intelligence, particularly with the introduction of groundbreaking innovations like ChatGPT. Large-scale Transformer networks have quickly become the leading approach…

Artificial Intelligence · Computer Science 2024-12-12 Wei Wang , Qing Li

Universal Transformers

Recurrent neural networks (RNNs) sequentially process data by updating their state with each new data point, and have long been the de facto choice for sequence modeling tasks. However, their inherently sequential computation makes them…

Computation and Language · Computer Science 2019-03-06 Mostafa Dehghani , Stephan Gouws , Oriol Vinyals , Jakob Uszkoreit , Łukasz Kaiser

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Universal Approximation Theorem for a Single-Layer Transformer

Deep learning employs multi-layer neural networks trained via the backpropagation algorithm. This approach has achieved success across many domains and relies on adaptive gradient methods such as the Adam optimizer. Sequence modeling…

Machine Learning · Computer Science 2025-07-16 Esmail Gumaan

Unlimiformer: Long-Range Transformers with Unlimited Length Input

Since the proposal of transformers, these models have been limited to bounded input lengths, because of their need to attend to every token in the input. In this work, we propose Unlimiformer: a general approach that wraps any existing…

Computation and Language · Computer Science 2023-11-01 Amanda Bertsch , Uri Alon , Graham Neubig , Matthew R. Gormley

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example,…

Machine Learning · Computer Science 2022-03-15 Yi Tay , Mostafa Dehghani , Dara Bahri , Donald Metzler

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on…

Computation and Language · Computer Science 2021-09-07 Chuhan Wu , Fangzhao Wu , Tao Qi , Yongfeng Huang , Xing Xie

TeamFormer: Shallow Parallel Transformers with Progressive Approximation

The widespread 'deeper is better' philosophy has driven the creation of architectures like ResNet and Transformer, which achieve high performance by stacking numerous layers. However, increasing model depth comes with challenges such as…

Machine Learning · Computer Science 2026-02-25 Wei Wang , Xiao-Yong Wei , Qing Li

Longformer: The Long-Document Transformer

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism…

Computation and Language · Computer Science 2020-12-03 Iz Beltagy , Matthew E. Peters , Arman Cohan

Rethinking Transformer Connectivity: TLinFormer, A Path to Exact, Full Context-Aware Linear Attention

The Transformer architecture has become a cornerstone of modern artificial intelligence, but its core self-attention mechanism suffers from a complexity bottleneck that scales quadratically with sequence length, severely limiting its…

Machine Learning · Computer Science 2025-08-29 Zhongpan Tang

Transformers are Expressive, But Are They Expressive Enough for Regression?

Transformers have become pivotal in Natural Language Processing, demonstrating remarkable success in applications like Machine Translation and Summarization. Given their widespread adoption, several works have attempted to analyze the…

Machine Learning · Computer Science 2024-09-02 Swaroop Nath , Harshad Khadilkar , Pushpak Bhattacharyya

Transformer-Transducer: End-to-End Speech Recognition with Self-Attention

We explore options to use Transformer networks in neural transducer for end-to-end speech recognition. Transformer networks use self-attention for sequence modeling and comes with advantages in parallel computation and capturing contexts.…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-30 Ching-Feng Yeh , Jay Mahadeokar , Kaustubh Kalgaonkar , Yongqiang Wang , Duc Le , Mahaveer Jain , Kjell Schubert , Christian Fuegen , Michael L. Seltzer

Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning

The Transformer is a highly successful deep learning model that has revolutionised the world of artificial neural networks, first in natural language processing and later in computer vision. This model is based on the attention mechanism…

Machine Learning · Computer Science 2023-05-09 Riccardo Ughi , Eugenio Lomurno , Matteo Matteucci

Langformers: Unified NLP Pipelines for Language Models

Transformer-based language models have revolutionized the field of natural language processing (NLP). However, using these models often involves navigating multiple frameworks and tools, as well as writing repetitive boilerplate code. This…

Computation and Language · Computer Science 2025-04-15 Rabindra Lamsal , Maria Rodriguez Read , Shanika Karunasekera

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To…

Computation and Language · Computer Science 2023-07-20 Jiayu Ding , Shuming Ma , Li Dong , Xingxing Zhang , Shaohan Huang , Wenhui Wang , Nanning Zheng , Furu Wei

Sequence Complementor: Complementing Transformers For Time Series Forecasting with Learnable Sequences

Since its introduction, the transformer has shifted the development trajectory away from traditional models (e.g., RNN, MLP) in time series forecasting, which is attributed to its ability to capture global dependencies within temporal…

Machine Learning · Computer Science 2025-01-07 Xiwen Chen , Peijie Qiu , Wenhui Zhu , Huayu Li , Hao Wang , Aristeidis Sotiras , Yalin Wang , Abolfazl Razi

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

LeaPformer: Enabling Linear Transformers for Autoregressive and Simultaneous Tasks via Learned Proportions

A promising approach to preserving model performance in linearized transformers is to employ position-based re-weighting functions. However, state-of-the-art re-weighting functions rely heavily on target sequence lengths, making it…

Computation and Language · Computer Science 2024-05-24 Victor Agostinelli , Sanghyun Hong , Lizhong Chen