Related papers: Reformer: The Efficient Transformer

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

The NLP Task Effectiveness of Long-Range Transformers

Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have…

Computation and Language · Computer Science 2024-12-10 Guanghui Qin , Yukun Feng , Benjamin Van Durme

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example,…

Machine Learning · Computer Science 2022-03-15 Yi Tay , Mostafa Dehghani , Dara Bahri , Donald Metzler

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the…

Machine Learning · Computer Science 2020-09-01 Angelos Katharopoulos , Apoorv Vyas , Nikolaos Pappas , François Fleuret

Generating Long Sequences with Sparse Transformers

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also…

Machine Learning · Computer Science 2019-04-25 Rewon Child , Scott Gray , Alec Radford , Ilya Sutskever

Leaner Transformers: More Heads, Less Depth

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means…

Machine Learning · Computer Science 2025-05-28 Hemanth Saratchandran , Damien Teney , Simon Lucey

Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer

Pretrained transformer models have demonstrated remarkable performance across various natural language processing tasks. These models leverage the attention mechanism to capture long- and short-range dependencies in the sequence. However,…

Computation and Language · Computer Science 2023-10-20 Qingru Zhang , Dhananjay Ram , Cole Hawkins , Sheng Zha , Tuo Zhao

Linformer: Self-Attention with Linear Complexity

Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences,…

Machine Learning · Computer Science 2020-06-16 Sinong Wang , Belinda Z. Li , Madian Khabsa , Han Fang , Hao Ma

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear…

Machine Learning · Computer Science 2022-08-02 Tan Nguyen , Richard G. Baraniuk , Robert M. Kirby , Stanley J. Osher , Bao Wang

Long-Short Transformer: Efficient Transformers for Language and Vision

Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Chen Zhu , Wei Ping , Chaowei Xiao , Mohammad Shoeybi , Tom Goldstein , Anima Anandkumar , Bryan Catanzaro

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network…

Computation and Language · Computer Science 2022-04-14 Qingyang Wu , Zhenzhong Lan , Kun Qian , Jing Gu , Alborz Geramifard , Zhou Yu

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring…

Machine Learning · Computer Science 2020-12-22 Valerii Likhosherstov , Krzysztof Choromanski , Jared Davis , Xingyou Song , Adrian Weller

Dynamic Query Selection for Fast Visual Perceiver

Transformers have been matching deep convolutional networks for vision architectures in recent works. Most work is focused on getting the best results on large-scale benchmarks, and scaling laws seem to be the most successful strategy:…

Computer Vision and Pattern Recognition · Computer Science 2023-03-23 Corentin Dancette , Matthieu Cord

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many…

Machine Learning · Computer Science 2023-10-31 Stefano Massaroli , Michael Poli , Daniel Y. Fu , Hermann Kumbong , Rom N. Parnichkun , Aman Timalsina , David W. Romero , Quinn McIntyre , Beidi Chen , Atri Rudra , Ce Zhang , Christopher Re , Stefano Ermon , Yoshua Bengio

When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment

Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, and determining how actions influence future returns. Both challenges involve modeling long-term…

Machine Learning · Computer Science 2023-11-06 Tianwei Ni , Michel Ma , Benjamin Eysenbach , Pierre-Luc Bacon

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability…

Machine Learning · Computer Science 2021-03-30 Haoyi Zhou , Shanghang Zhang , Jieqi Peng , Shuai Zhang , Jianxin Li , Hui Xiong , Wancai Zhang

RecurFormer: Not All Transformer Heads Need Self-Attention

Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism's memory overhead. We observe…

Computation and Language · Computer Science 2024-10-18 Ruiqing Yan , Linghan Zheng , Xingbo Du , Han Zou , Yufeng Guo , Jianfei Yang

Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers

As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model's pre-trained context limits, has emerged as a critical challenge. While existing approaches to…

Computation and Language · Computer Science 2025-11-18 Woomin Song , Sai Muralidhar Jayanthi , Srikanth Ronanki , Kanthashree Mysore Sathyendra , Jinwoo Shin , Aram Galstyan , Shubham Katiyar , Sravan Babu Bodapati