Related papers: Pushdown Layers: Encoding Recursive Structure in T…

Adaptive Large Language Models By Layerwise Attention Shortcuts

Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we…

Computation and Language · Computer Science 2024-12-24 Prateek Verma , Mert Pilanci

Contextual Memory Reweaving in Large Language Models Using Layered Latent State Reconstruction

Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in…

Computation and Language · Computer Science 2025-03-26 Frederick Dillon , Gregor Halvorsen , Simon Tattershall , Magnus Rowntree , Gareth Vanderpool

Incorporating Residual and Normalization Layers into Analysis of Masked Language Models

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

Analyzing the Structure of Attention in a Transformer Language Model

The Transformer is a fully attention-based alternative to recurrent networks that has achieved state-of-the-art results across a range of NLP tasks. In this paper, we analyze the structure of attention in a Transformer language model, the…

Computation and Language · Computer Science 2019-06-20 Jesse Vig , Yonatan Belinkov

TransfoRNN: Capturing the Sequential Information in Self-Attention Representations for Language Modeling

In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long…

Computation and Language · Computer Science 2021-04-06 Tze Yuang Chong , Xuyang Wang , Lin Yang , Junjie Wang

Pre-Training a Graph Recurrent Network for Language Representation

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be…

Computation and Language · Computer Science 2022-10-27 Yile Wang , Linyi Yang , Zhiyang Teng , Ming Zhou , Yue Zhang

A Systematic Analysis of Hybrid Linear Attention

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading…

Computation and Language · Computer Science 2025-07-10 Dustin Wang , Rui-Jie Zhu , Steven Abreu , Yong Shan , Taylor Kergan , Yuqi Pan , Yuhong Chou , Zheng Li , Ge Zhang , Wenhao Huang , Jason Eshraghian

StackTrans: From Large Language Model to Large Pushdown Automata Model

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the…

Software Engineering · Computer Science 2025-08-05 Kechi Zhang , Ge Li , Jia Li , Huangzhao Zhang , Yihong Dong , Jia Li , Jingjing Xu , Zhi Jin

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To…

Computation and Language · Computer Science 2025-10-27 Mutian He , Philip N. Garner

Reversed Attention: On The Gradient Descent Of Attention Layers In GPT

The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward…

Computation and Language · Computer Science 2024-12-24 Shahar Katz , Lior Wolf

When Can Self-Attention Be Replaced by Feed Forward Layers?

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain…

Computation and Language · Computer Science 2024-01-25 Brian DuSell , David Chiang

Augmenting Self-attention with Persistent Memory

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long…

Machine Learning · Computer Science 2019-07-03 Sainbayar Sukhbaatar , Edouard Grave , Guillaume Lample , Herve Jegou , Armand Joulin

Transformer Grammars: Augmenting Transformer Language Models with Syntactic Inductive Biases at Scale

We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are…

Computation and Language · Computer Science 2022-12-07 Laurent Sartran , Samuel Barrett , Adhiguna Kuncoro , Miloš Stanojević , Phil Blunsom , Chris Dyer

Language Modeling with Learned Meta-Tokens

While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using…

Computation and Language · Computer Science 2025-09-23 Alok N. Shah , Khush Gupta , Keshav Ramji , Pratik Chaudhari

Talking Heads: Understanding Inter-layer Communication in Transformer Language Models

Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to…

Computation and Language · Computer Science 2025-05-12 Jack Merullo , Carsten Eickhoff , Ellie Pavlick

ReSSFormer: A Recursive Sparse Structured Transformer for Scalable and Long-Context Reasoning

While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer…

Computation and Language · Computer Science 2025-10-03 Haochen You , Baojing Liu

Language Modeling with Deep Transformers

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured…

Computation and Language · Computer Science 2019-09-25 Kazuki Irie , Albert Zeyer , Ralf Schlüter , Hermann Ney

Structural Latency Perturbation in Large Language Models Through Recursive State Induction

Computational efficiency has remained a critical consideration in scaling high-capacity language models, with inference latency and resource consumption presenting significant constraints on real-time applications. The study has introduced…

Computation and Language · Computer Science 2025-03-26 Michael Mangrum , Jonathan Pemberton , Benedict Wetherby , Philip Montague