Related papers: Memorizing Transformers

Extended Mind Transformers

Pre-trained language models demonstrate general intelligence and common sense, but long inputs quickly become a bottleneck for memorizing information at inference time. We resurface a simple method, Memorizing Transformers (Wu et al.,…

Machine Learning · Computer Science 2024-06-05 Phoebe Klett , Thomas Ahle

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Language Model Memory and Memory Models for Language

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically…

Computation and Language · Computer Science 2026-05-20 Benjamin L. Badger

Arithmetic with Language Models: from Memorization to Computation

A better understanding of the emergent computation and problem-solving capabilities of recent large language models is of paramount importance to further improve them and broaden their applicability. This work investigates how a language…

Artificial Intelligence · Computer Science 2024-08-05 Davide Maltoni , Matteo Ferrara

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks.…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Mark Sandler , Andrey Zhmoginov , Max Vladymyrov , Andrew Jackson

Efficient Nearest Neighbor Language Models

Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. While effective, these models often…

Computation and Language · Computer Science 2021-11-16 Junxian He , Graham Neubig , Taylor Berg-Kirkpatrick

Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping

Continual incorporation of new knowledge is essential for the long-term evolution of large language models (LLMs). Existing approaches typically rely on parameter-update algorithms to mitigate catastrophic forgetting, yet they suffer from…

Machine Learning · Computer Science 2026-05-07 Kaustubh Pethkar , Ziyang Xiong , Zuofeng Shang , Yingcong Li

Retrieval-Augmented Transformer for Image Captioning

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Sara Sarto , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

Machine Learning · Statistics 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

How much do language models memorize?

We propose a new method for estimating how much a model knows about a datapoint and use it to measure the capacity of modern language models. Prior studies of language model memorization have struggled to disentangle memorization from…

Computation and Language · Computer Science 2025-06-19 John X. Morris , Chawin Sitawarin , Chuan Guo , Narine Kokhlikyan , G. Edward Suh , Alexander M. Rush , Kamalika Chaudhuri , Saeed Mahloujifar

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Pretraining with hierarchical memories: separating long-tail and common knowledge

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel

Language Modeling using LMUs: 10x Better Data Efficiency or Improved Scaling Compared to Transformers

Recent studies have demonstrated that the performance of transformers on the task of language modeling obeys a power-law relationship with model size over six orders of magnitude. While transformers exhibit impressive scaling, their…

Machine Learning · Computer Science 2021-10-07 Narsimha Chilkuri , Eric Hunsberger , Aaron Voelker , Gurshaant Malik , Chris Eliasmith

Modifying Memories in Transformer Models

Large Transformer models have achieved impressive performance in many natural language tasks. In particular, Transformer based language models have been shown to have great capabilities in encoding factual knowledge in their vast amount of…

Computation and Language · Computer Science 2020-12-02 Chen Zhu , Ankit Singh Rawat , Manzil Zaheer , Srinadh Bhojanapalli , Daliang Li , Felix Yu , Sanjiv Kumar

Provable Benefits of In-Tool Learning for Large Language Models

Tool-augmented language models, equipped with retrieval, memory, or external APIs, are reshaping AI, yet their theoretical advantages remain underexplored. In this paper, we address this question by demonstrating the benefits of in-tool…

Machine Learning · Computer Science 2026-04-03 Sam Houliston , Ambroise Odonnat , Charles Arnal , Vivien Cabannes

Simple and Effective Input Reformulations for Translation

Foundation language models learn from their finetuning input context in different ways. In this paper, we reformulate inputs during finetuning for challenging translation tasks, leveraging model strengths from pretraining in novel ways to…

Computation and Language · Computer Science 2026-01-05 Brian Yu , Hansen Lillemark , Kurt Keutzer

Scaling Transformer to 1M tokens and beyond with RMT

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer…

Computation and Language · Computer Science 2024-02-07 Aydar Bulatov , Yuri Kuratov , Yermek Kapushev , Mikhail S. Burtsev

Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models

Despite their wide adoption, the underlying training and memorization dynamics of very large language models is not well understood. We empirically study exact memorization in causal and masked language modeling, across model sizes and…

Computation and Language · Computer Science 2022-11-04 Kushal Tirumala , Aram H. Markosyan , Luke Zettlemoyer , Armen Aghajanyan

Efficiently Adapting Pretrained Language Models To New Languages

Recent large language models (LLM) exhibit sub-optimal performance on low-resource languages, as the training data of these models is usually dominated by English and other high-resource languages. Furthermore, it is challenging to train…

Computation and Language · Computer Science 2023-12-18 Zoltan Csaki , Pian Pawakapan , Urmish Thakker , Qiantong Xu

Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In…

Computation and Language · Computer Science 2023-11-22 Tim Hartill , Joshua Bensemann , Michael Witbrock , Patricia J. Riddle