Related papers: Transformers from an Optimization Perspective

Revisiting Transformers with Insights from Image Filtering and Boosting

The self-attention mechanism, a cornerstone of Transformer-based state-of-the-art deep learning architectures, is largely heuristic-driven and fundamentally challenging to interpret. Establishing a robust theoretical foundation to explain…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Laziz U. Abdullaev , Maksim Tkachenko , Tan M. Nguyen

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

Transformers as Intrinsic Optimizers: Forward Inference through the Energy Principle

Attention-based Transformers have demonstrated strong adaptability across a wide range of tasks and have become the backbone of modern Large Language Models (LLMs). However, their underlying mechanisms remain open for further exploration.…

Machine Learning · Computer Science 2026-01-13 Ruifeng Ren , Sheng Ouyang , Huayi Tang , Yong Liu

Transformers predicting the future. Applying attention in next-frame and time series forecasting

Recurrent Neural Networks were, until recently, one of the best ways to capture the timely dependencies in sequences. However, with the introduction of the Transformer, it has been proven that an architecture with only attention-mechanisms…

Machine Learning · Computer Science 2021-08-19 Radostin Cholakov , Todor Kolev

Transformers learn in-context by gradient descent

At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to…

Machine Learning · Computer Science 2023-06-01 Johannes von Oswald , Eyvind Niklasson , Ettore Randazzo , João Sacramento , Alexander Mordvintsev , Andrey Zhmoginov , Max Vladymyrov

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We focus on the Transformers for our…

Computation and Language · Computer Science 2019-09-05 Elena Voita , Rico Sennrich , Ivan Titov

Two Steps Forward and One Behind: Rethinking Time Series Forecasting with Deep Learning

The Transformer is a highly successful deep learning model that has revolutionised the world of artificial neural networks, first in natural language processing and later in computer vision. This model is based on the attention mechanism…

Machine Learning · Computer Science 2023-05-09 Riccardo Ughi , Eugenio Lomurno , Matteo Matteucci

Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile…

Machine Learning · Computer Science 2026-03-25 Chenyang Zhang , Qingyue Zhao , Quanquan Gu , Yuan Cao

Introduction to Transformers: an NLP Perspective

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes…

Computation and Language · Computer Science 2023-11-30 Tong Xiao , Jingbo Zhu

Convexifying Transformers: Improving optimization and understanding of transformer networks

Understanding the fundamental mechanism behind the success of transformer networks is still an open problem in the deep learning literature. Although their remarkable performance has been mostly attributed to the self-attention mechanism,…

Machine Learning · Computer Science 2022-11-23 Tolga Ergen , Behnam Neyshabur , Harsh Mehta

A Comprehensive Survey on Applications of Transformers for Deep Learning Tasks

Transformer is a deep neural network that employs a self-attention mechanism to comprehend the contextual relationships within sequential data. Unlike conventional neural networks or updated versions of Recurrent Neural Networks (RNNs) such…

Machine Learning · Computer Science 2023-06-14 Saidul Islam , Hanae Elmekki , Ahmed Elsebai , Jamal Bentahar , Najat Drawel , Gaith Rjoub , Witold Pedrycz

Analyzing Deep Transformer Models for Time Series Forecasting via Manifold Learning

Transformer models have consistently achieved remarkable results in various domains such as natural language processing and computer vision. However, despite ongoing research efforts to better understand these models, the field still lacks…

Machine Learning · Computer Science 2024-10-18 Ilya Kaufman , Omri Azencot

A Practical Survey on Faster and Lighter Transformers

Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model…

Machine Learning · Computer Science 2023-03-28 Quentin Fournier , Gaétan Marceau Caron , Daniel Aloise

Understanding the Expressive Power and Mechanisms of Transformer for Sequence Modeling

We conduct a systematic study of the approximation properties of Transformer for sequence modeling with long, sparse and complicated memory. We investigate the mechanisms through which different components of Transformer, such as the…

Machine Learning · Computer Science 2024-10-31 Mingze Wang , Weinan E

A Mathematical Explanation of Transformers

The Transformer architecture has revolutionized the field of sequence modeling and underpins the recent breakthroughs in large language models (LLMs). However, a comprehensive mathematical theory that explains its structure and operations…

Machine Learning · Computer Science 2026-04-14 Xue-Cheng Tai , Hao Liu , Lingfeng Li , Raymond H. Chan

What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and…

Machine Learning · Computer Science 2025-03-18 Weronika Ormaniec , Felix Dangel , Sidak Pal Singh

Deriving Transformer Architectures as Implicit Multinomial Regression

While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically,…

Machine Learning · Computer Science 2025-10-28 Jonas A. Actor , Anthony Gruber , Eric C. Cyr

An Introduction to Transformers

The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and…

Machine Learning · Computer Science 2026-01-21 Richard E. Turner

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

Analyzing Transformers in Embedding Space

Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent…

Computation and Language · Computer Science 2023-12-27 Guy Dar , Mor Geva , Ankit Gupta , Jonathan Berant