Related papers: Memory Layers at Scale

Large Memory Layers with Product Keys

This paper introduces a structured memory which can be easily integrated into a neural network. The memory is very large by design and significantly increases the capacity of the architecture, by up to a billion parameters with a negligible…

Computation and Language · Computer Science 2019-12-17 Guillaume Lample , Alexandre Sablayrolles , Marc'Aurelio Ranzato , Ludovic Denoyer , Hervé Jégou

Mixture of Chapters: Scaling Learnt Memory in Transformers

Transformers lack an explicit architectural mechanism for storing and organizing knowledge acquired during training. We introduce learnable sparse memory banks: a set of latent tokens, randomly initialized and trained end-to-end, that…

Machine Learning · Computer Science 2026-03-24 Tasmay Pankaj Tibrewal , Pritish Saha , Ankit Meda , Kunal Singh , Pradeep Moturi

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments…

Computation and Language · Computer Science 2020-05-05 Sandeep Subramanian , Ronan Collobert , Marc'Aurelio Ranzato , Y-Lan Boureau

LESA: Learnable LLM Layer Scaling-Up

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger…

Machine Learning · Computer Science 2025-02-20 Yifei Yang , Zouying Cao , Xinbei Ma , Yao Yao , Libo Qin , Zhi Chen , Hai Zhao

Pretraining with hierarchical memories: separating long-tail and common knowledge

The impressive performance gains of modern language models currently rely on scaling parameters: larger models store more world knowledge and reason better. Yet compressing all world knowledge into parameters is unnecessary, as only a…

Computation and Language · Computer Science 2026-03-24 Hadi Pouransari , David Grangier , C Thomas , Michael Kirchhof , Oncel Tuzel

Transformer Feed-Forward Layers Are Key-Value Memories

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where…

Computation and Language · Computer Science 2021-09-07 Mor Geva , Roei Schuster , Jonathan Berant , Omer Levy

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by…

Machine Learning · Computer Science 2025-05-30 Athanasios Glentis , Jiaxiang Li , Qiulin Shang , Andi Han , Ioannis Tsaknakis , Quan Wei , Mingyi Hong

The Power of External Memory in Increasing Predictive Model Capacity

One way of introducing sparsity into deep networks is by attaching an external table of parameters that is sparsely looked up at different layers of the network. By storing the bulk of the parameters in the external table, one can increase…

Machine Learning · Computer Science 2023-02-02 Cenk Baykal , Dylan J Cutler , Nishanth Dikkala , Nikhil Ghosh , Rina Panigrahy , Xin Wang

Beyond Memorization: Extending Reasoning Depth with Recurrence, Memory and Test-Time Compute Scaling

Reasoning is a core capability of large language models, yet how multi-step reasoning is learned and executed remains unclear. We study this question in a controlled cellular-automata (1dCA) framework that excludes memorisation by using…

Machine Learning · Computer Science 2026-05-08 Ivan Rodkin , Daniil Orel , Konstantin Smirnov , Arman Bolatov , Bilal Elbouardi , Besher Hassan , Yuri Kuratov , Aydar Bulatov , Preslav Nakov , Timothy Baldwin , Artem Shelmanov , Mikhail Burtsev

Continual Learning via Sparse Memory Finetuning

Modern language models are powerful, but typically static after deployment. A major obstacle to building models that continually learn over time is catastrophic forgetting, where updating on new data erases previously acquired capabilities.…

Computation and Language · Computer Science 2025-10-20 Jessy Lin , Luke Zettlemoyer , Gargi Ghosh , Wen-Tau Yih , Aram Markosyan , Vincent-Pierre Berges , Barlas Oğuz

Sparse Layers are Critical to Scaling Looped Language Models

Looped language models repeat a set of transformer layers through depth, reducing memory costs and providing natural early-exit points at loop boundaries. However, looped models do not scale as favorably as standard transformers with unique…

Machine Learning · Computer Science 2026-05-12 Ryan Lee , Jacob Biloki , Edward J. Hu , Jonathan May

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks.…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Mark Sandler , Andrey Zhmoginov , Max Vladymyrov , Andrew Jackson

Multi-level Memory for Task Oriented Dialogs

Recent end-to-end task oriented dialog systems use memory architectures to incorporate external knowledge in their dialogs. Current work makes simplifying assumptions about the structure of the knowledge base, such as the use of triples to…

Computation and Language · Computer Science 2020-09-30 Revanth Reddy , Danish Contractor , Dinesh Raghu , Sachindra Joshi

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This…

Machine Learning · Computer Science 2025-02-18 Jonas Geiping , Sean McLeish , Neel Jain , John Kirchenbauer , Siddharth Singh , Brian R. Bartoldson , Bhavya Kailkhura , Abhinav Bhatele , Tom Goldstein

Memorizing Transformers

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus…

Machine Learning · Computer Science 2022-03-18 Yuhuai Wu , Markus N. Rabe , DeLesley Hutchins , Christian Szegedy

Learning Features with Parameter-Free Layers

Trainable layers such as convolutional building blocks are the standard network design choices by learning parameters to capture the global context through successive spatial operations. When designing an efficient network, trainable layers…

Computer Vision and Pattern Recognition · Computer Science 2022-03-22 Dongyoon Han , YoungJoon Yoo , Beomyoung Kim , Byeongho Heo

Language Model Memory and Memory Models for Language

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically…

Computation and Language · Computer Science 2026-05-20 Benjamin L. Badger

Hash Layers For Large Sparse Models

We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on…

Machine Learning · Computer Science 2021-07-21 Stephen Roller , Sainbayar Sukhbaatar , Arthur Szlam , Jason Weston

Scaling Laws for Associative Memories

Learning arguably involves the discovery and memorization of abstract rules. The aim of this paper is to study associative memory mechanisms. Our model is based on high-dimensional matrices consisting of outer products of embeddings, which…

Machine Learning · Statistics 2024-02-22 Vivien Cabannes , Elvis Dohmatob , Alberto Bietti

Memory capacity of neural network models

Memory is a complex phenomenon that involves several distinct mechanisms. These mechanisms operate at different spatial and temporal levels. This chapter focuses on the theoretical framework and the mathematical models that have been…

Neurons and Cognition · Quantitative Biology 2021-12-22 Stefano Fusi