English
Related papers

Related papers: Pushdown Layers: Encoding Recursive Structure in T…

200 papers

Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we…

Computation and Language · Computer Science 2024-12-24 Prateek Verma , Mert Pilanci

Memory retention challenges in deep neural architectures have ongoing limitations in the ability to process and recall extended contextual information. Token dependencies degrade as sequence length increases, leading to a decline in…

Computation and Language · Computer Science 2025-03-26 Frederick Dillon , Gregor Halvorsen , Simon Tattershall , Magnus Rowntree , Gareth Vanderpool

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only…

Computation and Language · Computer Science 2021-09-16 Goro Kobayashi , Tatsuki Kuribayashi , Sho Yokoi , Kentaro Inui

The Transformer is a fully attention-based alternative to recurrent networks that has achieved state-of-the-art results across a range of NLP tasks. In this paper, we analyze the structure of attention in a Transformer language model, the…

Computation and Language · Computer Science 2019-06-20 Jesse Vig , Yonatan Belinkov

In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long…

Computation and Language · Computer Science 2021-04-06 Tze Yuang Chong , Xuyang Wang , Lin Yang , Junjie Wang

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be…

Computation and Language · Computer Science 2022-10-27 Yile Wang , Linyi Yang , Zhiyang Teng , Ming Zhou , Yue Zhang

Transformers face quadratic complexity and memory issues with long sequences, prompting the adoption of linear attention mechanisms using fixed-size hidden states. However, linear models often suffer from limited recall performance, leading…

Computation and Language · Computer Science 2025-07-10 Dustin Wang , Rui-Jie Zhu , Steven Abreu , Yong Shan , Taylor Kergan , Yuqi Pan , Yuhong Chou , Zheng Li , Ge Zhang , Wenhao Huang , Jason Eshraghian

The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the…

Software Engineering · Computer Science 2025-08-05 Kechi Zhang , Ge Li , Jia Li , Huangzhao Zhang , Yihong Dong , Jia Li , Jingjing Xu , Zhi Jin

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Linear-attention models that compress the entire input sequence into a fixed-size recurrent state offer an efficient alternative to Transformers, but their finite memory induces forgetfulness that harms retrieval-intensive tasks. To…

Computation and Language · Computer Science 2025-10-27 Mutian He , Philip N. Garner

The success of Transformer-based Language Models (LMs) stems from their attention mechanism. While this mechanism has been extensively studied in explainability research, particularly through the attention values obtained during the forward…

Computation and Language · Computer Science 2024-12-24 Shahar Katz , Lior Wolf

Recently, self-attention models such as Transformers have given competitive results compared to recurrent neural network systems in speech recognition. The key factor for the outstanding performance of self-attention models is their ability…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-29 Shucong Zhang , Erfan Loweimi , Peter Bell , Steve Renals

Attention, specifically scaled dot-product attention, has proven effective for natural language, but it does not have a mechanism for handling hierarchical patterns of arbitrary nesting depth, which limits its ability to recognize certain…

Computation and Language · Computer Science 2024-01-25 Brian DuSell , David Chiang

Transformer networks have lead to important progress in language modeling and machine translation. These models include two consecutive modules, a feed-forward layer and a self-attention layer. The latter allows the network to capture long…

Machine Learning · Computer Science 2019-07-03 Sainbayar Sukhbaatar , Edouard Grave , Guillaume Lample , Herve Jegou , Armand Joulin

We introduce Transformer Grammars (TGs), a novel class of Transformer language models that combine (i) the expressive power, scalability, and strong performance of Transformers and (ii) recursive syntactic compositions, which here are…

Computation and Language · Computer Science 2022-12-07 Laurent Sartran , Samuel Barrett , Adhiguna Kuncoro , Miloš Stanojević , Phil Blunsom , Chris Dyer

While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using…

Computation and Language · Computer Science 2025-09-23 Alok N. Shah , Khush Gupta , Keshav Ramji , Pratik Chaudhari

Although it is known that transformer language models (LMs) pass features from early layers to later layers, it is not well understood how this information is represented and routed by the model. We analyze a mechanism used in two LMs to…

Computation and Language · Computer Science 2025-05-12 Jack Merullo , Carsten Eickhoff , Ellie Pavlick

While Transformer architectures have demonstrated impressive scalability across domains, they continue to face challenges in long-context reasoning, computational efficiency, and structural generalization - largely due to rigid layer…

Computation and Language · Computer Science 2025-10-03 Haochen You , Baojing Liu

We explore deep autoregressive Transformer models in language modeling for speech recognition. We focus on two aspects. First, we revisit Transformer model configurations specifically for language modeling. We show that well configured…

Computation and Language · Computer Science 2019-09-25 Kazuki Irie , Albert Zeyer , Ralf Schlüter , Hermann Ney

Computational efficiency has remained a critical consideration in scaling high-capacity language models, with inference latency and resource consumption presenting significant constraints on real-time applications. The study has introduced…

Computation and Language · Computer Science 2025-03-26 Michael Mangrum , Jonathan Pemberton , Benedict Wetherby , Philip Montague
‹ Prev 1 2 3 10 Next ›