Related papers: Augmenting Self-attention with Persistent Memory

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Transformer Feed-Forward Layers Are Key-Value Memories

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where…

Computation and Language · Computer Science 2021-09-07 Mor Geva , Roei Schuster , Jonathan Berant , Omer Levy

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Attention is All You Need Until You Need Retention

This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic…

Machine Learning · Computer Science 2025-01-17 M. Murat Yaslioglu

A Primal-Dual Framework for Transformers and Neural Networks

Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often…

Machine Learning · Computer Science 2024-06-21 Tan M. Nguyen , Tam Nguyen , Nhat Ho , Andrea L. Bertozzi , Richard G. Baraniuk , Stanley J. Osher

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by…

Machine Learning · Computer Science 2020-01-01 Thomas Dowdell , Hongyu Zhang

Modeling Recurrence for Transformer

Recently, the Transformer model that is based solely on attention mechanisms, has advanced the state-of-the-art on various machine translation tasks. However, recent studies reveal that the lack of recurrence hinders its further improvement…

Computation and Language · Computer Science 2019-04-08 Jie Hao , Xing Wang , Baosong Yang , Longyue Wang , Jinfeng Zhang , Zhaopeng Tu

Memory Transformer

Transformer-based models have achieved state-of-the-art results in many natural language processing tasks. The self-attention architecture allows transformer to combine information from all elements of a sequence into context-aware…

Computation and Language · Computer Science 2021-02-17 Mikhail S. Burtsev , Yuri Kuratov , Anton Peganov , Grigory V. Sapunov

Transformer-F: A Transformer network with effective methods for learning universal sentence representation

The Transformer model is widely used in natural language processing for sentence representation. However, the previous Transformer-based models focus on function words that have limited meaning in most cases and could merely extract…

Computation and Language · Computer Science 2021-07-05 Yu Shi

Context-Aware Self-Attention Networks

Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the…

Computation and Language · Computer Science 2019-02-18 Baosong Yang , Jian Li , Derek Wong , Lidia S. Chao , Xing Wang , Zhaopeng Tu

Self-attention as an attractor network: transient memories without backpropagation

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an…

Machine Learning · Computer Science 2024-09-25 Francesco D'Amico , Matteo Negri

Improving Transformer Models by Reordering their Sublayers

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with…

Computation and Language · Computer Science 2020-04-24 Ofir Press , Noah A. Smith , Omer Levy

Agglomerative Attention

Neural networks using transformer-based architectures have recently demonstrated great power and flexibility in modeling sequences of many types. One of the core components of transformer networks is the attention layer, which allows…

Machine Learning · Computer Science 2019-07-16 Matthew Spellings

Do Transformers Need Deep Long-Range Memory

Deep attention models have advanced the modelling of sequential data across many domains. For language modelling in particular, the Transformer-XL -- a Transformer augmented with a long-range memory of past activations -- has been shown to…

Machine Learning · Computer Science 2020-07-08 Jack W. Rae , Ali Razavi

Pre-Training a Graph Recurrent Network for Language Representation

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be…

Computation and Language · Computer Science 2022-10-27 Yile Wang , Linyi Yang , Zhiyang Teng , Ming Zhou , Yue Zhang

A Survey of Retentive Network

Retentive Network (RetNet) represents a significant advancement in neural network architecture, offering an efficient alternative to the Transformer. While Transformers rely on self-attention to model dependencies, they suffer from high…

Computation and Language · Computer Science 2025-06-10 Haiqi Yang , Zhiyuan Li , Yi Chang , Yuan Wu

LaMemo: Language Modeling with Look-Ahead Memory

Although Transformers with fully connected self-attentions are powerful to model long-term dependencies, they are struggling to scale to long texts with thousands of words in language modeling. One of the solutions is to equip the model…

Computation and Language · Computer Science 2022-04-27 Haozhe Ji , Rongsheng Zhang , Zhenyu Yang , Zhipeng Hu , Minlie Huang

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural…

Computation and Language · Computer Science 2020-04-02 Prakhar Thapak , Prodip Hore