Related papers: Global memory transformer for processing long docu…

GMAT: Global Memory Augmentation for Transformers

Transformer-based models have become ubiquitous in natural language processing thanks to their large capacity, innate parallelism and high performance. The contextualizing component of a Transformer block is the $\textit{pairwise…

Machine Learning · Computer Science 2020-06-08 Ankit Gupta , Jonathan Berant

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Extended Mind Transformers

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al.,…

Machine Learning · Computer Science 2024-06-05 Phoebe Klett , Thomas Ahle

Exploring Length Generalization in Large Language Models

The ability to extrapolate from short problem instances to longer ones is an important form of out-of-distribution generalization in reasoning tasks, and is crucial when learning from datasets where longer problem instances are rare. These…

Computation and Language · Computer Science 2022-11-15 Cem Anil , Yuhuai Wu , Anders Andreassen , Aitor Lewkowycz , Vedant Misra , Vinay Ramasesh , Ambrose Slone , Guy Gur-Ari , Ethan Dyer , Behnam Neyshabur

Uncertainty Guided Global Memory Improves Multi-Hop Question Answering

Transformers have become the gold standard for many natural language processing tasks and, in particular, for multi-hop question answering (MHQA). This task includes processing a long document and reasoning over the multiple parts of it.…

Computation and Language · Computer Science 2023-12-01 Alsu Sagirova , Mikhail Burtsev

Transformer Working Memory Enables Regular Language Reasoning and Natural Language Length Extrapolation

Unlike recurrent models, conventional wisdom has it that Transformers cannot perfectly model regular languages. Inspired by the notion of working memory, we propose a new Transformer variant named RegularGPT. With its novel combination of…

Computation and Language · Computer Science 2023-05-09 Ta-Chung Chi , Ting-Han Fan , Alexander I. Rudnicky , Peter J. Ramadge

Think Before You Act: Decision Transformers with Working Memory

Decision Transformer-based decision-making agents have shown the ability to generalize across multiple tasks. However, their performance relies on massive data and computation. We argue that this inefficiency stems from the forgetting…

Machine Learning · Computer Science 2024-05-30 Jikun Kang , Romain Laroche , Xingdi Yuan , Adam Trischler , Xue Liu , Jie Fu

Investigating Pre-trained Language Models on Cross-Domain Datasets, a Step Closer to General AI

Pre-trained language models have recently emerged as a powerful tool for fine-tuning a variety of language tasks. Ideally, when models are pre-trained on large amount of data, they are expected to gain implicit knowledge. In this paper, we…

Computation and Language · Computer Science 2023-06-22 Mohamad Ballout , Ulf Krumnack , Gunther Heidemann , Kai-Uwe Kühnberger

Memory Augmented Large Language Models are Computationally Universal

We show that transformer-based large language models are computationally universal when augmented with an external memory. Any deterministic language model that conditions on strings of bounded length is equivalent to a finite automaton,…

Computation and Language · Computer Science 2023-01-12 Dale Schuurmans

Scaling Transformer to 1M tokens and beyond with RMT

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer…

Computation and Language · Computer Science 2024-02-07 Aydar Bulatov , Yuri Kuratov , Yermek Kapushev , Mikhail S. Burtsev

LM2: Large Memory Models

This paper introduces the Large Memory Model (LM2), a decoder-only Transformer architecture enhanced with an auxiliary memory module that aims to address the limitations of standard Transformers in multi-step reasoning, relational…

Computation and Language · Computer Science 2025-02-11 Jikun Kang , Wenqi Wu , Filippos Christianos , Alex J. Chan , Fraser Greenlee , George Thomas , Marvin Purtorab , Andy Toulis

Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows

Most approaches to long-context processing increase the complexity of the transformer's internal architecture by integrating mechanisms such as recurrence or auxiliary memory modules. In this work, we introduce an alternative approach that…

Computation and Language · Computer Science 2025-10-28 Billy Dickson , Zoran Tiganj

Structured Memory Mechanisms for Stable Context Representation in Large Language Models

This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information…

Computation and Language · Computer Science 2025-05-30 Yue Xing , Tao Yang , Yijiashun Qi , Minggu Wei , Yu Cheng , Honghui Xin

Transformer with Memory Replay

Transformers achieve state-of-the-art performance for natural language processing tasks by pre-training on large-scale text corpora. They are extremely compute-intensive and have very high sample complexity. Memory replay is a mechanism…

Machine Learning · Computer Science 2022-05-23 Rui Liu , Barzan Mozafari

Transformers and Slot Encoding for Sample Efficient Physical World Modelling

World modelling, i.e. building a representation of the rules that govern the world so as to predict its evolution, is an essential ability for any agent interacting with the physical world. Recent applications of the Transformer…

Machine Learning · Computer Science 2024-05-31 Francesco Petri , Luigi Asprino , Aldo Gangemi

Memory-Augmented Generative Adversarial Transformers

Conversational AI systems that rely on Large Language Models, like Transformers, have difficulty interweaving external data (like facts) with the language they generate. Vanilla Transformer architectures are not designed for answering…

Computation and Language · Computer Science 2024-03-01 Stephan Raaijmakers , Roos Bakker , Anita Cremers , Roy de Kleijn , Tom Kouwenhoven , Tessa Verhoef

On Memory: A comparison of memory mechanisms in world models

World models enable agents to plan within imagined environments by predicting future states conditioned on past observations and actions. However, their ability to plan over long horizons is limited by the effective memory span of the…

Artificial Intelligence · Computer Science 2025-12-09 Eli J. Laird , Corey Clark

Adaptive Semiparametric Language Models

We present a language model that combines a large parametric neural network (i.e., a transformer) with a non-parametric episodic memory component in an integrated architecture. Our model uses extended short-term context by caching local…

Computation and Language · Computer Science 2021-02-05 Dani Yogatama , Cyprien de Masson d'Autume , Lingpeng Kong

Global-to-local Memory Pointer Networks for Task-Oriented Dialogue

End-to-end task-oriented dialogue is challenging since knowledge bases are usually large, dynamic and hard to incorporate into a learning framework. We propose the global-to-local memory pointer (GLMP) networks to address this issue. In our…

Computation and Language · Computer Science 2019-04-01 Chien-Sheng Wu , Richard Socher , Caiming Xiong

Pretraining with hierarchical memories: separating long-tail and common knowledge

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel