Related papers: An Evolved Universal Transformer Memory

Neural Attention Memory

We propose a novel perspective of the attention mechanism by reinventing it as a memory architecture for neural networks, namely Neural Attention Memory (NAM). NAM is a memory structure that is both readable and writable via differentiable…

Machine Learning · Computer Science 2023-10-17 Hyoungwook Nam , Seung Byum Seo

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Memory-Augmented Transformers: A Systematic Review from Neuroscience Principles to Enhanced Model Architectures

Memory is fundamental to intelligence, enabling learning, reasoning, and adaptability across biological and artificial systems. While Transformer architectures excel at sequence modeling, they face critical limitations in long-range context…

Machine Learning · Computer Science 2025-08-19 Parsa Omidi , Xingshuai Huang , Axel Laborieux , Bahareh Nikpour , Tianyu Shi , Armaghan Eshaghi

One-shot Learning with Memory-Augmented Neural Networks

Despite recent breakthroughs in the applications of deep neural networks, one setting that presents a persistent challenge is that of "one-shot learning." Traditional gradient-based networks require a lot of data to learn, often through…

Machine Learning · Computer Science 2016-05-20 Adam Santoro , Sergey Bartunov , Matthew Botvinick , Daan Wierstra , Timothy Lillicrap

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by…

Machine Learning · Computer Science 2020-01-01 Thomas Dowdell , Hongyu Zhang

NAM: Normalization-based Attention Module

Recognizing less salient features is the key for model compression. However, it has not been investigated in the revolutionary attention mechanisms. In this work, we propose a novel normalization-based attention module (NAM), which…

Computer Vision and Pattern Recognition · Computer Science 2021-11-25 Yichao Liu , Zongru Shao , Yueyang Teng , Nico Hoffmann

ABC: Attention with Bounded-memory Control

Transformer architectures have achieved state-of-the-art results on a variety of sequence modeling tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead…

Computation and Language · Computer Science 2022-06-03 Hao Peng , Jungo Kasai , Nikolaos Pappas , Dani Yogatama , Zhaofeng Wu , Lingpeng Kong , Roy Schwartz , Noah A. Smith

On Difficulties of Attention Factorization through Shared Memory

Transformers have revolutionized deep learning in numerous fields, including natural language processing, computer vision, and audio processing. Their strength lies in their attention mechanism, which allows for the discovering of complex…

Machine Learning · Computer Science 2024-04-02 Uladzislau Yorsh , Martin Holeňa , Ondřej Bojar , David Herel

TransformerFAM: Feedback attention is working memory

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages…

Machine Learning · Computer Science 2024-05-08 Dongseong Hwang , Weiran Wang , Zhuoyuan Huo , Khe Chai Sim , Pedro Moreno Mengibar

Selective Attention Improves Transformer

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention…

Computation and Language · Computer Science 2025-04-25 Yaniv Leviathan , Matan Kalman , Yossi Matias

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Attention-based Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration.…

Machine Learning · Computer Science 2026-01-13 Ruifeng Ren , Sheng Ouyang , Huayi Tang , Yong Liu

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Preconditioned Attention: Enhancing Efficiency in Transformers

Central to the success of Transformers is the attention block, which effectively models global dependencies among input tokens associated to a dataset. However, we theoretically demonstrate that standard attention mechanisms in transformers…

Machine Learning · Computer Science 2026-03-31 Hemanth Saratchandran

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined…

Computation and Language · Computer Science 2025-06-16 Hanzhi Zhang , Heng Fan , Kewei Sha , Yan Huang , Yunhe Feng

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural…

Computation and Language · Computer Science 2020-04-02 Prakhar Thapak , Prodip Hore

Multi-layer Learnable Attention Mask for Multimodal Tasks

While the Self-Attention mechanism in the Transformer model has proven to be effective in many domains, we observe that it is less effective in more diverse settings (e.g. multimodality) due to the varying granularity of each token and the…

Computer Vision and Pattern Recognition · Computer Science 2024-06-06 Wayner Barrios , SouYoung Jin

Can Active Memory Replace Attention?

Several mechanisms to focus attention of a neural network on selected parts of its input or memory have been used successfully in deep learning models in recent years. Attention has improved image classification, image captioning, speech…

Machine Learning · Computer Science 2017-03-08 Łukasz Kaiser , Samy Bengio

Temporal Attention Model for Neural Machine Translation

Attention-based Neural Machine Translation (NMT) models suffer from attention deficiency issues as has been observed in recent research. We propose a novel mechanism to address some of these limitations and improve the NMT attention.…

Computation and Language · Computer Science 2016-08-10 Baskaran Sankaran , Haitao Mi , Yaser Al-Onaizan , Abe Ittycheriah

Comparison of Neuronal Attention Models

Recent models for image processing are using the Convolutional neural network (CNN) which requires a pixel per pixel analysis of the input image. This method works well. However, it is time-consuming if we have large images. To increase the…

Machine Learning · Computer Science 2019-12-10 Mohamed Karim Belaid

Titans: Learning to Memorize at Test Time

Over more than a decade there has been an extensive research effort on how to effectively utilize recurrent models and attention. While recurrent models aim to compress the data into a fixed-size memory (called hidden state), attention…

Machine Learning · Computer Science 2025-01-03 Ali Behrouz , Peilin Zhong , Vahab Mirrokni