Related papers: Attention Approximates Sparse Distributed Memory

A New Training Algorithm for Kanerva's Sparse Distributed Memory

The Sparse Distributed Memory proposed by Pentii Kanerva (SDM in short) was thought to be a model of human long term memory. The architecture of the SDM permits to store binary patterns and to retrieve them using partially matching…

Computer Vision and Pattern Recognition · Computer Science 2012-07-30 Lou Marvin Caraig

NoiseFormer -- Noise Diffused Symmetric Attention Transformer

Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic…

Machine Learning · Computer Science 2026-01-21 Phani Kumar , Nyshadham , Jyothendra Varma , Polisetty V R K , Aditya Rathore

SparseBERT: Rethinking the Importance Analysis in Self-attention

Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention…

Machine Learning · Computer Science 2021-07-02 Han Shi , Jiahui Gao , Xiaozhe Ren , Hang Xu , Xiaodan Liang , Zhenguo Li , James T. Kwok

A Length Adaptive Algorithm-Hardware Co-design of Transformer on FPGA Through Sparse Attention and Dynamic Pipelining

Transformers are considered one of the most important deep learning models since 2018, in part because it establishes state-of-the-art (SOTA) records and could potentially replace existing Deep Neural Networks (DNNs). Despite the remarkable…

Machine Learning · Computer Science 2022-08-23 Hongwu Peng , Shaoyi Huang , Shiyang Chen , Bingbing Li , Tong Geng , Ang Li , Weiwen Jiang , Wujie Wen , Jinbo Bi , Hang Liu , Caiwen Ding

Adaptive Sparse and Monotonic Attention for Transformer-based Automatic Speech Recognition

The Transformer architecture model, based on self-attention and multi-head attention, has achieved remarkable success in offline end-to-end Automatic Speech Recognition (ASR). However, self-attention and multi-head attention cannot be…

Computation and Language · Computer Science 2022-10-03 Chendong Zhao , Jianzong Wang , Wen qi Wei , Xiaoyang Qu , Haoqian Wang , Jing Xiao

Transformer Acceleration with Dynamic Sparse Attention

Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at…

Machine Learning · Computer Science 2021-10-22 Liu Liu , Zheng Qu , Zhaodong Chen , Yufei Ding , Yuan Xie

The Kanerva Machine: A Generative Distributed Memory

We present an end-to-end trained memory system that quickly adapts to new data and generates samples like them. Inspired by Kanerva's sparse distributed memory, it has a robust distributed reading and writing mechanism. The memory is…

Machine Learning · Statistics 2018-06-19 Yan Wu , Greg Wayne , Alex Graves , Timothy Lillicrap

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by…

Machine Learning · Computer Science 2020-01-01 Thomas Dowdell , Hongyu Zhang

Sparse Distributed Memory is a Continual Learner

Continual learning is a problem for artificial neural networks that their biological counterparts are adept at solving. Building on work using Sparse Distributed Memory (SDM) to connect a core neural circuit with the powerful Transformer…

Neural and Evolutionary Computing · Computer Science 2023-03-28 Trenton Bricken , Xander Davies , Deepak Singh , Dmitry Krotov , Gabriel Kreiman

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into…

Computation and Language · Computer Science 2026-05-26 Spandan Pratyush

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements…

Computation and Language · Computer Science 2024-06-25 Chao Lou , Zixia Jia , Zilong Zheng , Kewei Tu

The Information Pathways Hypothesis: Transformers are Dynamic Self-Ensembles

Transformers use the dense self-attention mechanism which gives a lot of flexibility for long-range connectivity. Over multiple layers of a deep transformer, the number of possible connectivity patterns increases exponentially. However,…

Machine Learning · Computer Science 2023-06-05 Md Shamim Hussain , Mohammed J. Zaki , Dharmashankar Subramanian

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a…

Machine Learning · Computer Science 2022-10-28 Sungjun Cho , Seonwoo Min , Jinwoo Kim , Moontae Lee , Honglak Lee , Seunghoon Hong

Understanding Transformer from the Perspective of Associative Memory

In this paper, we share our reflections and insights on understanding Transformer architectures through the lens of associative memory--a classic psychological concept inspired by human cognition. We start with the basics of associative…

Machine Learning · Computer Science 2025-05-27 Shu Zhong , Mingyu Xu , Tenglong Ao , Guang Shi

Efficient Content-Based Sparse Attention with Routing Transformers

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic compute and memory requirements with respect to sequence length. Successful approaches…

Machine Learning · Computer Science 2020-10-27 Aurko Roy , Mohammad Saffar , Ashish Vaswani , David Grangier

Towards understanding how attention mechanism works in deep learning

Attention mechanism has been extensively integrated within mainstream neural network architectures, such as Transformers and graph attention networks. Yet, its underlying working principles remain somewhat elusive. What is its essence? Are…

Machine Learning · Computer Science 2024-12-25 Tianyu Ruan , Shihua Zhang

Analyzing the Structure of Attention in a Transformer Language Model

The Transformer is a fully attention-based alternative to recurrent networks that has achieved state-of-the-art results across a range of NLP tasks. In this paper, we analyze the structure of attention in a Transformer language model, the…

Computation and Language · Computer Science 2019-06-20 Jesse Vig , Yonatan Belinkov

Smart Bird: Learnable Sparse Attention for Efficient and Effective Transformer

Transformer has achieved great success in NLP. However, the quadratic complexity of the self-attention mechanism in Transformer makes it inefficient in handling long sequences. Many existing works explore to accelerate Transformers by…

Computation and Language · Computer Science 2021-09-03 Chuhan Wu , Fangzhao Wu , Tao Qi , Binxing Jiao , Daxin Jiang , Yongfeng Huang , Xing Xie

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Trading with the Momentum Transformer: An Intelligent and Interpretable Architecture

We introduce the Momentum Transformer, an attention-based deep-learning architecture, which outperforms benchmark time-series momentum and mean-reversion trading strategies. Unlike state-of-the-art Long Short-Term Memory (LSTM)…

Machine Learning · Computer Science 2022-11-24 Kieran Wood , Sven Giegerich , Stephen Roberts , Stefan Zohren