Related papers: Do Transformers Need Deep Long-Range Memory

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling. We propose a novel neural architecture Transformer-XL that enables learning dependency beyond a…

Machine Learning · Computer Science 2019-06-04 Zihang Dai , Zhilin Yang , Yiming Yang , Jaime Carbonell , Quoc V. Le , Ruslan Salakhutdinov

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example,…

Machine Learning · Computer Science 2022-03-15 Yi Tay , Mostafa Dehghani , Dara Bahri , Donald Metzler

X-Former: In-Memory Acceleration of Transformers

Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these…

Machine Learning · Computer Science 2023-03-15 Shrihari Sridharan , Jacob R. Stevens , Kaushik Roy , Anand Raghunathan

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Exploring Transformers for Large-Scale Speech Recognition

While recurrent neural networks still largely define state-of-the-art speech recognition systems, the Transformer network has been proven to be a competitive alternative, especially in the offline condition. Most studies with Transformers…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-13 Liang Lu , Changliang Liu , Jinyu Li , Yifan Gong

Transformer-XL for Long Sequence Tasks in Robotic Learning from Demonstration

This paper presents an innovative application of Transformer-XL for long sequence tasks in robotic learning from demonstrations (LfD). The proposed framework effectively integrates multi-modal sensor inputs, including RGB-D images, LiDAR,…

Robotics · Computer Science 2025-12-16 Gao Tianci

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long…

Machine Learning · Computer Science 2019-07-03 Sainbayar Sukhbaatar , Edouard Grave , Guillaume Lample , Herve Jegou , Armand Joulin

The NLP Task Effectiveness of Long-Range Transformers

Transformer models cannot easily scale to long sequences due to their O(N^2) time and space complexity. This has led to Transformer variants seeking to lower computational complexity, such as Longformer and Performer. While such models have…

Computation and Language · Computer Science 2024-12-10 Guanghui Qin , Yukun Feng , Benjamin Van Durme

Sub-Linear Memory: How to Make Performers SLiM

The Transformer architecture has revolutionized deep learning on sequential data, becoming ubiquitous in state-of-the-art solutions for a wide variety of applications. Yet vanilla Transformers are notoriously resource-expensive, requiring…

Machine Learning · Computer Science 2020-12-22 Valerii Likhosherstov , Krzysztof Choromanski , Jared Davis , Xingyou Song , Adrian Weller

LongNet: Scaling Transformers to 1,000,000,000 Tokens

Scaling sequence length has become a critical demand in the era of large language models. However, existing methods struggle with either computational complexity or model expressivity, rendering the maximum sequence length restricted. To…

Computation and Language · Computer Science 2023-07-20 Jiayu Ding , Shuming Ma , Li Dong , Xingxing Zhang , Shaohan Huang , Wenhui Wang , Nanning Zheng , Furu Wei

On Difficulties of Attention Factorization through Shared Memory

Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex…

Machine Learning · Computer Science 2024-04-02 Uladzislau Yorsh , Martin Holeňa , Ondřej Bojar , David Herel

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

The Transformer architecture has significantly advanced deep learning, particularly in natural language processing, by effectively managing long-range dependencies. However, as the demand for understanding complex relationships grows,…

Computation and Language · Computer Science 2024-06-18 Qian Chen , Wen Wang , Qinglin Zhang , Siqi Zheng , Shiliang Zhang , Chong Deng , Hai Yu , Jiaqing Liu , Yukun Ma , Chong Zhang

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

Leaner Transformers: More Heads, Less Depth

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means…

Machine Learning · Computer Science 2025-05-28 Hemanth Saratchandran , Damien Teney , Simon Lucey

Reformer: The Efficient Transformer

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of…

Machine Learning · Computer Science 2020-02-19 Nikita Kitaev , Łukasz Kaiser , Anselm Levskaya

On the Long Range Abilities of Transformers

Despite their dominance in modern DL and, especially, NLP domains, transformer architectures exhibit sub-optimal performance on long-range tasks compared to recent layers that are specifically designed for this purpose. In this work,…

Machine Learning · Computer Science 2023-11-29 Itamar Zimerman , Lior Wolf

Transformer-based World Models Are Happy With 100k Interactions

Deep neural networks have been successful in many reinforcement learning settings. However, compared to human learners they are overly data hungry. To build a sample-efficient world model, we apply a transformer to real-world episodes in an…

Machine Learning · Computer Science 2023-03-14 Jan Robine , Marc Höftmann , Tobias Uelwer , Stefan Harmeling

TransformerFAM: Feedback attention is working memory

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages…

Machine Learning · Computer Science 2024-05-08 Dongseong Hwang , Weiran Wang , Zhuoyuan Huo , Khe Chai Sim , Pedro Moreno Mengibar