Related papers: Emergent Stack Representations in Modeling Counter…

A Meta-Learning Perspective on Transformers for Causal Language Modeling

The Transformer architecture has become prominent in developing large causal language models. However, mechanisms to explain its capabilities are not well understood. Focused on the training process, here we establish a meta-learning view…

Machine Learning · Computer Science 2024-03-26 Xinbo Wu , Lav R. Varshney

On the Ability and Limitations of Transformers to Recognize Formal Languages

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on…

Computation and Language · Computer Science 2020-10-09 Satwik Bhattamishra , Kabir Ahuja , Navin Goyal

Unveiling Multilinguality in Transformer Models: Exploring Language Specificity in Feed-Forward Networks

Recent research suggests that the feed-forward module within Transformers can be viewed as a collection of key-value memories, where the keys learn to capture specific patterns from the input based on the training examples. The values then…

Computation and Language · Computer Science 2023-10-25 Sunit Bhattacharya , Ondrej Bojar

Multi-scale Transformer Language Models

We investigate multi-scale transformer language models that learn representations of text at multiple scales, and present three different architectures that have an inductive bias to handle the hierarchical nature of language. Experiments…

Computation and Language · Computer Science 2020-05-05 Sandeep Subramanian , Ronan Collobert , Marc'Aurelio Ranzato , Y-Lan Boureau

A Transformer with Stack Attention

Natural languages are believed to be (mildly) context-sensitive. Despite underpinning remarkably capable large language models, transformers are unable to model many context-free language tasks. In an attempt to address this limitation in…

Computation and Language · Computer Science 2024-05-15 Jiaoda Li , Jennifer C. White , Mrinmaya Sachan , Ryan Cotterell

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

Machine Learning · Statistics 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

Linguistic Interpretability of Transformer-based Language Models: a systematic review

Language models based on the Transformer architecture achieve excellent results in many language-related tasks, such as text classification or sentiment analysis. However, despite the architecture of these models being well-defined, little…

Computation and Language · Computer Science 2025-04-14 Miguel López-Otal , Jorge Gracia , Jordi Bernad , Carlos Bobed , Lucía Pitarch-Ballesteros , Emma Anglés-Herrero

Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-based Question Answering

We introduce a novel approach to transformers that learns hierarchical representations in multiparty dialogue. First, three language modeling tasks are used to pre-train the transformers, token- and utterance-level language modeling and…

Computation and Language · Computer Science 2020-06-01 Changmao Li , Jinho D. Choi

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We focus on the Transformers for our…

Computation and Language · Computer Science 2019-09-05 Elena Voita , Rico Sennrich , Ivan Titov

A Primer on the Inner Workings of Transformer-based Language Models

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical…

Computation and Language · Computer Science 2024-10-15 Javier Ferrando , Gabriele Sarti , Arianna Bisazza , Marta R. Costa-jussà

How do Transformers perform In-Context Autoregressive Learning?

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a…

Machine Learning · Statistics 2024-06-06 Michael E. Sander , Raja Giryes , Taiji Suzuki , Mathieu Blondel , Gabriel Peyré

Learning to Transduce with Unbounded Memory

Recently, strong results have been demonstrated by Deep Recurrent Neural Networks on natural language transduction problems. In this paper we explore the representational power of these models using synthetic grammars designed to exhibit…

Neural and Evolutionary Computing · Computer Science 2015-11-04 Edward Grefenstette , Karl Moritz Hermann , Mustafa Suleyman , Phil Blunsom

The Hidden Space of Transformer Language Adapters

We analyze the operation of transformer language adapters, which are small modules trained on top of a frozen language model to adapt its predictions to new target languages. We show that adapted predictions mostly evolve in the source…

Computation and Language · Computer Science 2024-06-11 Jesujoba O. Alabi , Marius Mosbach , Matan Eyal , Dietrich Klakow , Mor Geva

Memorizing Transformers

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus…

Machine Learning · Computer Science 2022-03-18 Yuhuai Wu , Markus N. Rabe , DeLesley Hutchins , Christian Szegedy

Transformers converge to invariant algorithmic cores

Large language models exhibit sophisticated capabilities, yet understanding how they work internally remains a central challenge. A fundamental obstacle is that training selects for behavior, not circuitry, so many weight configurations can…

Machine Learning · Computer Science 2026-02-27 Joshua S. Schiffman

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to…

Computation and Language · Computer Science 2020-10-19 Xian Li , Asa Cooper Stickland , Yuqing Tang , Xiang Kong

Exploring Internal Numeracy in Language Models: A Case Study on ALBERT

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to…

Computation and Language · Computer Science 2024-04-26 Ulme Wennberg , Gustav Eje Henter

Backward Lens: Projecting Language Model Gradients into the Vocabulary Space

Understanding how Transformer-based Language Models (LMs) learn and recall information is a key goal of the deep learning community. Recent interpretability methods project weights and hidden states obtained from the forward pass to the…

Computation and Language · Computer Science 2024-02-21 Shahar Katz , Yonatan Belinkov , Mor Geva , Lior Wolf

Pre-Training a Graph Recurrent Network for Language Representation

Transformer-based pre-trained models have gained much advance in recent years, becoming one of the most important backbones in natural language processing. Recent work shows that the attention mechanism inside Transformer may not be…

Computation and Language · Computer Science 2022-10-27 Yile Wang , Linyi Yang , Zhiyang Teng , Ming Zhou , Yue Zhang

Solve the Loop: Attractor Models for Language and Reasoning

Looped Transformers offer a promising alternative to purely feed-forward computation by iteratively refining latent representations, improving language modeling and reasoning. Yet recurrent architectures remain unstable to train, costly to…

Machine Learning · Computer Science 2026-05-13 Jacob Fein-Ashley , Paria Rashidinejad