Related papers: Quantifying Context Mixing in Transformers

Measuring the Mixing of Contextual Information in the Transformer

The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that…

Computation and Language · Computer Science 2022-10-25 Javier Ferrando , Gerard I. Gállego , Marta R. Costa-jussà

Transformers Learn Latent Mixture Models In-Context via Mirror Descent

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly…

Machine Learning · Computer Science 2026-04-14 Francesco D'Angelo , Nicolas Flammarion

In-Context Convergence of Transformers

Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable in-context learning capability, where models can solve an unseen task by utilizing task-specific prompts without…

Machine Learning · Computer Science 2023-10-10 Yu Huang , Yuan Cheng , Yingbin Liang

On Identifiability in Transformers

In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and…

Computation and Language · Computer Science 2020-02-10 Gino Brunner , Yang Liu , Damián Pascual , Oliver Richter , Massimiliano Ciaramita , Roger Wattenhofer

Quantifying Attention Flow in Transformers

In the Transformer model, "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens…

Machine Learning · Computer Science 2020-06-02 Samira Abnar , Willem Zuidema

Structured Self-Attention Weights Encode Semantics in Sentiment Analysis

Neural attention, especially the self-attention made popular by the Transformer, has become the workhorse of state-of-the-art natural language processing (NLP) models. Very recent work suggests that the self-attention in the Transformer…

Computation and Language · Computer Science 2020-10-16 Zhengxuan Wu , Thanh-Son Nguyen , Desmond C. Ong

Attention Weights in Transformer NMT Fail Aligning Words Between Sequences but Largely Explain Model Predictions

This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment…

Computation and Language · Computer Science 2021-09-14 Javier Ferrando , Marta R. Costa-jussà

Effective Context in Transformers: An Analysis of Fragmentation and Tokenization

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same…

Machine Learning · Computer Science 2026-05-14 Amirmehdi Jafari Fesharaki , Mohammadamin Rami , Aslan Tchamkerten

Pooling Attention: Evaluating Pretrained Transformer Embeddings for Deception Classification

This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight…

Computation and Language · Computer Science 2025-12-01 Sumit Mamtani , Abhijeet Bhure

Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in…

Computation and Language · Computer Science 2025-09-23 Asif Shahriar , Rifat Shahriyar , M Saifur Rahman

More Identifiable yet Equally Performant Transformers for Text Classification

Interpretability is an important aspect of the trustworthiness of a model's predictions. Transformer's predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head).…

Computation and Language · Computer Science 2021-06-03 Rishabh Bhardwaj , Navonil Majumder , Soujanya Poria , Eduard Hovy

Dissecting Query-Key Interaction in Vision Transformers

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Xu Pan , Aaron Philip , Ziqian Xie , Odelia Schwartz

Selective Attention: Enhancing Transformer through Principled Context Control

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same…

Machine Learning · Computer Science 2024-11-21 Xuechen Zhang , Xiangyu Chang , Mingchen Li , Amit Roy-Chowdhury , Jiasi Chen , Samet Oymak

Contextual Counting: A Mechanistic Study of Transformers on a Quantitative Task

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem…

Machine Learning · Computer Science 2024-06-06 Siavash Golkar , Alberto Bietti , Mariel Pettee , Michael Eickenberg , Miles Cranmer , Keiya Hirashima , Geraud Krawezik , Nicholas Lourie , Michael McCabe , Rudy Morel , Ruben Ohana , Liam Holden Parker , Bruno Régaldo-Saint Blancard , Kyunghyun Cho , Shirley Ho

Transformer ASR with Contextual Block Processing

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-17 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

Contextually Guided Transformers via Low-Rank Adaptation

Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates…

Machine Learning · Computer Science 2025-06-09 Andrey Zhmoginov , Jihwan Lee , Max Vladymyrov , Mark Sandler

Context-Aware Self-Attention Networks

Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the…

Computation and Language · Computer Science 2019-02-18 Baosong Yang , Jian Li , Derek Wong , Lidia S. Chao , Xing Wang , Zhaopeng Tu

In-Context Compositional Learning via Sparse Coding Transformer

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target…

Machine Learning · Computer Science 2025-11-26 Wei Chen , Jingxi Yu , Zichen Miao , Qiang Qiu

Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Neha Kalibhat , Priyatham Kattakinda , Sumit Nawathe , Arman Zarei , Nikita Seleznev , Samuel Sharpe , Senthil Kumar , Soheil Feizi