Related papers: Simplifying Transformer Blocks

Learning in Compact Spaces with Approximately Normalized Transformer

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization…

Machine Learning · Computer Science 2025-11-20 Jörg K. H. Franke , Urs Spiegelhalter , Marianna Nezhurina , Jenia Jitsev , Frank Hutter , Michael Hefenbrock

Brainformers: Trading Simplicity for Efficiency

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network.…

Machine Learning · Computer Science 2024-04-26 Yanqi Zhou , Nan Du , Yanping Huang , Daiyi Peng , Chang Lan , Da Huang , Siamak Shakeri , David So , Andrew Dai , Yifeng Lu , Zhifeng Chen , Quoc Le , Claire Cui , James Laudon , Jeff Dean

MLP Can Be A Good Transformer Learner

Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Sihao Lin , Pumeng Lyu , Dongrui Liu , Tao Tang , Xiaodan Liang , Andy Song , Xiaojun Chang

Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel…

Machine Learning · Computer Science 2023-02-22 Bobby He , James Martens , Guodong Zhang , Aleksandar Botev , Andrew Brock , Samuel L Smith , Yee Whye Teh

Transformer Utilization in Medical Image Segmentation Networks

Owing to success in the data-rich domain of natural images, Transformers have recently become popular in medical image segmentation. However, the pairing of Transformers with convolutional blocks in varying architectural permutations leaves…

Computer Vision and Pattern Recognition · Computer Science 2023-04-11 Saikat Roy , Gregor Koehler , Michael Baumgartner , Constantin Ulrich , Jens Petersen , Fabian Isensee , Klaus Maier-Hein

DeLighT: Deep and Light-weight Transformer

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within…

Machine Learning · Computer Science 2021-02-15 Sachin Mehta , Marjan Ghazvininejad , Srinivasan Iyer , Luke Zettlemoyer , Hannaneh Hajishirzi

Sparse is Enough in Scaling Transformers

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

Machine Learning · Computer Science 2021-11-29 Sebastian Jaszczur , Aakanksha Chowdhery , Afroz Mohiuddin , Łukasz Kaiser , Wojciech Gajewski , Henryk Michalewski , Jonni Kanerva

Formal Algorithms for Transformers

This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (*not* results). It covers what transformers are, how they are trained, what they are used for, their key architectural…

Machine Learning · Computer Science 2022-07-26 Mary Phuong , Marcus Hutter

ReplaceMe: Network Simplification via Depth Pruning and Transformer Block Linearization

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional…

Computation and Language · Computer Science 2026-02-20 Dmitriy Shopkhoev , Ammar Ali , Magauiya Zhussip , Valentin Malykh , Stamatios Lefkimmiatis , Nikos Komodakis , Sergey Zagoruyko

Hyperloop Transformers

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory…

Machine Learning · Computer Science 2026-04-28 Abbas Zeitoun , Lucas Torroba-Hennigen , Yoon Kim

A distributional simplicity bias in the learning dynamics of transformers

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing…

Computation and Language · Computer Science 2025-10-02 Riccardo Rende , Federica Gerace , Alessandro Laio , Sebastian Goldt

Introduction to Transformers: an NLP Perspective

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes…

Computation and Language · Computer Science 2023-11-30 Tong Xiao , Jingbo Zhu

Plain Transformers Can be Powerful Graph Learners

Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most…

Machine Learning · Computer Science 2026-01-30 Liheng Ma , Soumyasundar Pal , Yingxue Zhang , Philip H. S. Torr , Mark Coates

Wide Attention Is The Way Forward For Transformers?

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile…

Machine Learning · Computer Science 2026-03-25 Chenyang Zhang , Qingyue Zhao , Quanquan Gu , Yuan Cao

Weight subcloning: direct initialization of transformers using larger pretrained ones

Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained…

Machine Learning · Computer Science 2023-12-18 Mohammad Samragh , Mehrdad Farajtabar , Sachin Mehta , Raviteja Vemulapalli , Fartash Faghri , Devang Naik , Oncel Tuzel , Mohammad Rastegari

Pay Attention to MLPs

Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and…

Machine Learning · Computer Science 2021-06-03 Hanxiao Liu , Zihang Dai , David R. So , Quoc V. Le

Parameter Reduction Improves Vision Transformers: A Comparative Study of Sharing and Width Reduction

Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Anantha Padmanaban Krishna Kumar

An Empirical Study on the Transferability of Transformer Modules in Parameter-Efficient Fine-Tuning

Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look…

Computation and Language · Computer Science 2023-02-23 Mohammad Akbar-Tajari , Sara Rajaee , Mohammad Taher Pilehvar