Related papers: Adaptive Transformers for Learning Multimodal Repr…

Adaptively Sparse Transformers

Attention mechanisms have become ubiquitous in NLP. Recent architectures, notably the Transformer, learn powerful context-aware word representations through layered, multi-headed attention. The multiple heads learn diverse types of word…

Computation and Language · Computer Science 2019-09-09 Gonçalo M. Correia , Vlad Niculae , André F. T. Martins

Evolving Attention with Residual Convolutions

Transformer is a ubiquitous model for natural language processing and has attracted wide attentions in computer vision. The attention maps are indispensable for a transformer model to encode the dependencies among input tokens. However,…

Machine Learning · Computer Science 2021-02-26 Yujing Wang , Yaming Yang , Jiangang Bai , Mingliang Zhang , Jing Bai , Jing Yu , Ce Zhang , Gao Huang , Yunhai Tong

Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence…

Computation and Language · Computer Science 2020-12-24 Dongsheng Wang , Casper Hansen , Lucas Chaves Lima , Christian Hansen , Maria Maistro , Jakob Grue Simonsen , Christina Lioma

Attention mechanisms in neural networks

Attention mechanisms represent a fundamental paradigm shift in neural network architectures, enabling models to selectively focus on relevant portions of input sequences through learned weighting functions. This monograph provides a…

Machine Learning · Computer Science 2026-01-08 Hasi Hays

Attention-Based Explainability for Structure-Property Relationships

Machine learning methods are emerging as a universal paradigm for constructing correlative structure-property relationships in materials science based on multimodal characterization. However, this necessitates development of methods for…

Materials Science · Physics 2025-08-22 Boris N. Slautin , Utkarsh Pratiush , Yongtao Liu , Hiroshi Funakubo , Vladimir V. Shvartsman , Doru C. Lupascu , Sergei V. Kalinin

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Self-Ablating Transformers: More Interpretability, Less Sparsity

A growing intuition in machine learning suggests a link between sparsity and interpretability. We introduce a novel self-ablation mechanism to investigate this connection ante-hoc in the context of language transformers. Our approach…

Machine Learning · Computer Science 2025-05-02 Jeremias Ferrao , Luhan Mikaelson , Keenan Pepper , Natalia Perez-Campanero Antolin

Understanding Transformers and Attention Mechanisms: An Introduction for Applied Mathematicians

This document provides a brief introduction to the attention mechanism used in modern language models based on the Transformer architecture. We first illustrate how text is encoded as vectors and how the attention mechanism processes these…

Numerical Analysis · Mathematics 2026-04-02 Michel Fabrice Serret

Head Pursuit: Probing Attention Specialization in Multimodal Transformers

Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Lorenzo Basile , Valentino Maiorca , Diego Doimo , Francesco Locatello , Alberto Cazzaniga

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers

Recently multimodal transformer models have gained popularity because their performance on language and vision tasks suggest they learn rich visual-linguistic representations. Focusing on zero-shot image retrieval tasks, we study three…

Computation and Language · Computer Science 2021-02-02 Lisa Anne Hendricks , John Mellor , Rosalia Schneider , Jean-Baptiste Alayrac , Aida Nematzadeh

An Attention Matrix for Every Decision: Faithfulness-based Arbitration Among Multiple Attention-Based Interpretations of Transformers in Text Classification

Transformers are widely used in natural language processing, where they consistently achieve state-of-the-art performance. This is mainly due to their attention-based architecture, which allows them to model rich linguistic relations…

Computation and Language · Computer Science 2022-11-29 Nikolaos Mylonas , Ioannis Mollas , Grigorios Tsoumakas

Human Guided Exploitation of Interpretable Attention Patterns in Summarization and Topic Segmentation

The multi-head self-attention mechanism of the transformer model has been thoroughly investigated recently. In one vein of study, researchers are interested in understanding why and how transformers work. In another vein, researchers…

Computation and Language · Computer Science 2022-10-28 Raymond Li , Wen Xiao , Linzi Xing , Lanjun Wang , Gabriel Murray , Giuseppe Carenini

Sparse but not Simpler: A Multi-Level Interpretability Analysis of Vision Transformers

Sparse neural networks are often hypothesized to be more interpretable than dense models, motivated by findings that weight sparsity can produce compact circuits in language models. However, it remains unclear whether structural sparsity…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Siyu Zhang

Multi-View Self-Attention Based Transformer for Speaker Recognition

Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-28 Rui Wang , Junyi Ao , Long Zhou , Shujie Liu , Zhihua Wei , Tom Ko , Qing Li , Yu Zhang

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

Horizontal and Vertical Attention in Transformers

Transformers are built upon multi-head scaled dot-product attention and positional encoding, which aim to learn the feature representations and token dependencies. In this work, we focus on enhancing the distinctive representation by…

Computer Vision and Pattern Recognition · Computer Science 2022-07-12 Litao Yu , Jian Zhang

Interpreting Transformers Through Attention Head Intervention

Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms' decision-making processes, or mechanistic interpretability, enables (1) accountability and control in…

Computation and Language · Computer Science 2026-03-02 Mason Kadem , Rong Zheng

Multiformer: A Head-Configurable Transformer-Based Model for Direct Speech Translation

Transformer-based models have been achieving state-of-the-art results in several fields of Natural Language Processing. However, its direct application to speech tasks is not trivial. The nature of this sequences carries problems such as…

Computation and Language · Computer Science 2022-05-17 Gerard Sant , Gerard I. Gállego , Belen Alastruey , Marta R. Costa-Jussà

Transformer Interpretability from Perspective of Attention and Gradient

Although researchers' attention is more focused on the performance of Transformer models, the interpretation of Transformer can never be ignored. Gradient is widely utilized in Transformer interpretation. From the perspective of attention…

Artificial Intelligence · Computer Science 2026-05-13 Yongjin Cui , Xiaohui Fan , Huajun Chen