Related papers: Extended Mind Transformers

Memorizing Transformers

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus…

Machine Learning · Computer Science 2022-03-18 Yuhuai Wu , Markus N. Rabe , DeLesley Hutchins , Christian Szegedy

Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures

Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context…

Machine Learning · Computer Science 2025-08-19 Parsa Omidi , Xingshuai Huang , Axel Laborieux , Bahareh Nikpour , Tianyu Shi , Armaghan Eshaghi

Memory Efficient Continual Learning with Transformers

In many real-world scenarios, data to train machine learning models becomes available over time. Unfortunately, these models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is…

Computation and Language · Computer Science 2023-01-16 Beyza Ermis , Giovanni Zappella , Martin Wistuba , Aditya Rawal , Cedric Archambeau

Modifying Memories in Transformer Models

Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of…

Computation and Language · Computer Science 2020-12-02 Chen Zhu , Ankit Singh Rawat , Manzil Zaheer , Srinadh Bhojanapalli , Daliang Li , Felix Yu , Sanjiv Kumar

Memory in humans and deep language models: Linking hypotheses for model augmentation

The computational complexity of the self-attention mechanism in Transformer models significantly limits their ability to generalize over long temporal durations. Memory-augmentation, or the explicit storing of past information in external…

Computation and Language · Computer Science 2022-11-29 Omri Raccah , Phoebe Chen , Ted L. Willke , David Poeppel , Vy A. Vo

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Stateful Memory-Augmented Transformers for Efficient Dialogue Modeling

Transformer encoder-decoder models have achieved great performance in dialogue generation tasks, however, their inability to process long dialogue history often leads to truncation of the context To address this problem, we propose a novel…

Computation and Language · Computer Science 2023-05-24 Qingyang Wu , Zhou Yu

Pretraining with hierarchical memories: separating long-tail and common knowledge

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel

LaMemo: Language Modeling with Look-Ahead Memory

Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model…

Computation and Language · Computer Science 2022-04-27 Haozhe Ji , Rongsheng Zhang , Zhenyu Yang , Zhipeng Hu , Minlie Huang

On Difficulties of Attention Factorization through Shared Memory

Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex…

Machine Learning · Computer Science 2024-04-02 Uladzislau Yorsh , Martin Holeňa , Ondřej Bojar , David Herel

An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks

Access to external knowledge is essential for many natural language processing tasks, such as question answering and dialogue. Existing methods often rely on a parametric model that stores knowledge in its parameters, or use a…

Computation and Language · Computer Science 2022-11-01 Yuxiang Wu , Yu Zhao , Baotian Hu , Pasquale Minervini , Pontus Stenetorp , Sebastian Riedel

Memformer: A Memory-Augmented Transformer for Sequence Modeling

Transformers have reached remarkable success in sequence modeling. However, these models have efficiency issues as they need to store all the history token-level representations as memory. We present Memformer, an efficient neural network…

Computation and Language · Computer Science 2022-04-14 Qingyang Wu , Zhenzhong Lan , Kun Qian , Jing Gu , Alborz Geramifard , Zhou Yu

Augmenting Pre-trained Language Models with QA-Memory for Open-Domain Question Answering

Retrieval augmented language models have recently become the standard for knowledge intensive tasks. Rather than relying purely on latent semantics within the parameters of large neural models, these methods enlist a semi-parametric memory…

Computation and Language · Computer Science 2023-01-24 Wenhu Chen , Pat Verga , Michiel de Jong , John Wieting , William Cohen

Efficient Transformers: A Survey

Transformer model architectures have garnered immense interest lately due to their effectiveness across a range of domains like language, vision and reinforcement learning. In the field of natural language processing for example,…

Machine Learning · Computer Science 2022-03-15 Yi Tay , Mostafa Dehghani , Dara Bahri , Donald Metzler

Human-like fleeting memory improves language learning but impairs reading time prediction in transformer language models

Human memory is fleeting. As words are processed, the exact wordforms that make up incoming sentences are rapidly lost. Cognitive scientists have long believed that this limitation of memory may, paradoxically, help in learning language -…

Computation and Language · Computer Science 2026-05-11 Abishek Thamma , Micha Heilbron

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Memory Limitations of Prompt Tuning in Transformers

Despite the empirical success of prompt tuning in adapting pretrained language models to new tasks, theoretical analyses of its capabilities remain limited. Existing theoretical work primarily addresses universal approximation properties,…

Machine Learning · Computer Science 2025-09-03 Maxime Meyer , Mario Michelessa , Caroline Chaux , Vincent Y. F. Tan