Related papers: A Transformer with Stack Attention

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain…

Computation and Language · Computer Science 2024-01-25 Brian DuSell , David Chiang

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Evaluating Transformer's Ability to Learn Mildly Context-Sensitive Languages

Despite the fact that Transformers perform well in NLP tasks, recent studies suggest that self-attention is theoretically limited in learning even some regular and context-free languages. These findings motivated us to think about their…

Computation and Language · Computer Science 2023-10-20 Shunjie Wang , Shane Steinert-Threlkeld

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification

Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations…

Computation and Language · Computer Science 2022-11-29 Nikolaos Mylonas , Ioannis Mollas , Grigorios Tsoumakas

Selective Attention Improves Transformer

Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention…

Computation and Language · Computer Science 2025-04-25 Yaniv Leviathan , Matan Kalman , Yossi Matias

Temporal Attention for Language Models

Pretrained language models based on the transformer architecture have shown great success in NLP. Textual training data often comes from the web and is thus tagged with time-specific information, but most language models ignore this…

Computation and Language · Computer Science 2022-05-05 Guy D. Rosin , Kira Radinsky

Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence…

Computation and Language · Computer Science 2020-12-24 Dongsheng Wang , Casper Hansen , Lucas Chaves Lima , Christian Hansen , Maria Maistro , Jakob Grue Simonsen , Christina Lioma

Is Attention All What You Need? -- An Empirical Investigation on Convolution-Based Active Memory and Self-Attention

The key to a Transformer model is the self-attention mechanism, which allows the model to analyze an entire sequence in a computationally efficient manner. Recent work has suggested the possibility that general attention mechanisms used by…

Machine Learning · Computer Science 2020-01-01 Thomas Dowdell , Hongyu Zhang

Transformer++

Recent advancements in attention mechanisms have replaced recurrent neural networks and its variants for machine translation tasks. Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling. Neural…

Computation and Language · Computer Science 2020-04-02 Prakhar Thapak , Prodip Hore

StackTrans: From Large Language Model to Large Pushdown Automata Model

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the…

Software Engineering · Computer Science 2025-08-05 Kechi Zhang , Ge Li , Jia Li , Huangzhao Zhang , Yihong Dong , Jia Li , Jingjing Xu , Zhi Jin

Multi-View Self-Attention Based Transformer for Speaker Recognition

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-28 Rui Wang , Junyi Ao , Long Zhou , Shujie Liu , Zhihua Wei , Tom Ko , Qing Li , Yu Zhang

Fastformer: Additive Attention Can Be All You Need

Transformer is a powerful model for text understanding. However, it is inefficient due to its quadratic complexity to input sequence length. Although there are many methods on Transformer acceleration, they are still either inefficient on…

Computation and Language · Computer Science 2021-09-07 Chuhan Wu , Fangzhao Wu , Tao Qi , Yongfeng Huang , Xing Xie

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of…

Computation and Language · Computer Science 2019-12-30 Guangxiang Zhao , Junyang Lin , Zhiyuan Zhang , Xuancheng Ren , Qi Su , Xu Sun

Linear Log-Normal Attention with Unbiased Concentration

Transformer models have achieved remarkable results in a wide range of applications. However, their scalability is hampered by the quadratic time and memory complexity of the self-attention mechanism concerning the sequence length. This…

Machine Learning · Computer Science 2024-02-27 Yury Nahshan , Joseph Kampeas , Emir Haleva

Transition-based Parsing with Stack-Transformers

Modeling the parser state is key to good performance in transition-based parsing. Recurrent Neural Networks considerably improved the performance of transition-based systems by modelling the global state, e.g. stack-LSTM parsers, or local…

Computation and Language · Computer Science 2020-10-22 Ramon Fernandez Astudillo , Miguel Ballesteros , Tahira Naseem , Austin Blodgett , Radu Florian

Attention mechanisms in neural networks

Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions. This monograph provides a…

Machine Learning · Computer Science 2026-01-08 Hasi Hays

Characterizing the Expressivity of Local Attention in Transformers

The transformer is the most popular neural architecture for language modeling. The cornerstone of the transformer is its global attention mechanism, which lets the model aggregate information from all preceding tokens before generating the…

Computation and Language · Computer Science 2026-05-20 Jiaoda Li , Ryan Cotterell

Memorization in Attention-only Transformers

Recent research has explored the memorization capacity of multi-head attention, but these findings are constrained by unrealistic limitations on the context size. We present a novel proof for language-based Transformers that extends the…

Artificial Intelligence · Computer Science 2025-03-11 Léo Dana , Muni Sreenivas Pydi , Yann Chevaleyre