English
Related papers

Related papers: Quantifying Context Mixing in Transformers

200 papers

The Transformer architecture aggregates input information through the self-attention mechanism, but there is no clear understanding of how this information is mixed across the entire model. Additionally, recent works have demonstrated that…

Computation and Language · Computer Science 2022-10-25 Javier Ferrando , Gerard I. Gállego , Marta R. Costa-jussà

Sequence modelling requires determining which past tokens are causally relevant from the context and their importance: a process inherent to the attention layers in transformers, yet whose underlying learned mechanisms remain poorly…

Machine Learning · Computer Science 2026-04-14 Francesco D'Angelo , Nicolas Flammarion

Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable in-context learning capability, where models can solve an unseen task by utilizing task-specific prompts without…

Machine Learning · Computer Science 2023-10-10 Yu Huang , Yuan Cheng , Yingbin Liang

In this paper we delve deep in the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and…

Computation and Language · Computer Science 2020-02-10 Gino Brunner , Yang Liu , Damián Pascual , Oliver Richter , Massimiliano Ciaramita , Roger Wattenhofer

In the Transformer model, "self-attention" combines information from attended embeddings into the representation of the focal embedding in the next layer. Thus, across layers of the Transformer, information originating from different tokens…

Machine Learning · Computer Science 2020-06-02 Samira Abnar , Willem Zuidema

Neural attention, especially the self-attention made popular by the Transformer, has become the workhorse of state-of-the-art natural language processing (NLP) models. Very recent work suggests that the self-attention in the Transformer…

Computation and Language · Computer Science 2020-10-16 Zhengxuan Wu , Thanh-Son Nguyen , Desmond C. Ong

This work proposes an extensive analysis of the Transformer architecture in the Neural Machine Translation (NMT) setting. Focusing on the encoder-decoder attention mechanism, we prove that attention weights systematically make alignment…

Computation and Language · Computer Science 2021-09-14 Javier Ferrando , Marta R. Costa-jussà

Transformers predict over a representation of a sequence. The same data can be written as bytes, characters, or subword tokens, and these representations may be lossless. Yet, under a fixed context window, they need not expose the same…

Machine Learning · Computer Science 2026-05-14 Amirmehdi Jafari Fesharaki , Mohammadamin Rami , Aslan Tchamkerten

This paper investigates fake news detection as a downstream evaluation of Transformer representations, benchmarking encoder-only and decoder-only pre-trained models (BERT, GPT-2, Transformer-XL) as frozen embedders paired with lightweight…

Computation and Language · Computer Science 2025-12-01 Sumit Mamtani , Abhijeet Bhure

Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in…

Computation and Language · Computer Science 2025-09-23 Asif Shahriar , Rifat Shahriyar , M Saifur Rahman

Interpretability is an important aspect of the trustworthiness of a model's predictions. Transformer's predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head).…

Computation and Language · Computer Science 2021-06-03 Rishabh Bhardwaj , Navonil Majumder , Soujanya Poria , Eduard Hovy

Self-attention in vision transformers is often thought to perform perceptual grouping where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Xu Pan , Aaron Philip , Ziqian Xie , Odelia Schwartz

The attention mechanism within the transformer architecture enables the model to weigh and combine tokens based on their relevance to the query. While self-attention has enjoyed major success, it notably treats all queries $q$ in the same…

Machine Learning · Computer Science 2024-11-21 Xuechen Zhang , Xiangyu Chang , Mingchen Li , Amit Roy-Chowdhury , Jiasi Chen , Samet Oymak

Transformers have revolutionized machine learning across diverse domains, yet understanding their behavior remains crucial, particularly in high-stakes applications. This paper introduces the contextual counting task, a novel toy problem…

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks (RNNs) in end-to-end (E2E) automatic speech recognition (ASR) systems. However, the Transformer has a drawback in…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-17 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates…

Machine Learning · Computer Science 2025-06-09 Andrey Zhmoginov , Jihwan Lee , Max Vladymyrov , Mark Sandler

Self-attention model have shown its flexibility in parallel computation and the effectiveness on modeling both long- and short-term dependencies. However, it calculates the dependencies between representations without considering the…

Computation and Language · Computer Science 2019-02-18 Baosong Yang , Jian Li , Derek Wong , Lidia S. Chao , Xing Wang , Zhaopeng Tu

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target…

Machine Learning · Computer Science 2025-11-26 Wei Chen , Jingxi Yu , Zichen Miao , Qiang Qiu

Vision transformers have established a precedent of patchifying images into uniformly-sized chunks before processing. We hypothesize that this design choice may limit models in learning comprehensive and compositional representations from…

Computer Vision and Pattern Recognition · Computer Science 2025-05-20 Neha Kalibhat , Priyatham Kattakinda , Sumit Nawathe , Arman Zarei , Nikita Seleznev , Samuel Sharpe , Senthil Kumar , Soheil Feizi
‹ Prev 1 2 3 10 Next ›