English
Related papers

Related papers: Simplifying Transformer Blocks

200 papers

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization…

Machine Learning · Computer Science 2025-11-20 Jörg K. H. Franke , Urs Spiegelhalter , Marianna Nezhurina , Jenia Jitsev , Frank Hutter , Michael Hefenbrock

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network.…

Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Sihao Lin , Pumeng Lyu , Dongrui Liu , Tao Tang , Xiaodan Liang , Andy Song , Xiaojun Chang

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel…

Machine Learning · Computer Science 2023-02-22 Bobby He , James Martens , Guodong Zhang , Aleksandar Botev , Andrew Brock , Samuel L Smith , Yee Whye Teh

Owing to success in the data-rich domain of natural images, Transformers have recently become popular in medical image segmentation. However, the pairing of Transformers with convolutional blocks in varying architectural permutations leaves…

Computer Vision and Pattern Recognition · Computer Science 2023-04-11 Saikat Roy , Gregor Koehler , Michael Baumgartner , Constantin Ulrich , Jens Petersen , Fabian Isensee , Klaus Maier-Hein

We introduce a deep and light-weight transformer, DeLighT, that delivers similar or better performance than standard transformer-based models with significantly fewer parameters. DeLighT more efficiently allocates parameters both (1) within…

Machine Learning · Computer Science 2021-02-15 Sachin Mehta , Marjan Ghazvininejad , Srinivasan Iyer , Luke Zettlemoyer , Hannaneh Hajishirzi

Large Transformer models yield impressive results on many tasks, but are expensive to train, or even fine-tune, and so slow at decoding that their use and study becomes out of reach. We address this problem by leveraging sparsity. We study…

This document aims to be a self-contained, mathematically precise overview of transformer architectures and algorithms (*not* results). It covers what transformers are, how they are trained, what they are used for, their key architectural…

Machine Learning · Computer Science 2022-07-26 Mary Phuong , Marcus Hutter

We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional…

Computation and Language · Computer Science 2026-02-20 Dmitriy Shopkhoev , Ammar Ali , Magauiya Zhussip , Valentin Malykh , Stamatios Lefkimmiatis , Nikos Komodakis , Sergey Zagoruyko

LLM architecture research generally aims to maximize model quality subject to fixed compute/latency budgets. However, many applications of interest such as edge and on-device deployment are further constrained by the model's memory…

Machine Learning · Computer Science 2026-04-28 Abbas Zeitoun , Lucas Torroba-Hennigen , Yoon Kim

The remarkable capability of over-parameterised neural networks to generalise effectively has been explained by invoking a ``simplicity bias'': neural networks prevent overfitting by initially learning simple classifiers before progressing…

Computation and Language · Computer Science 2025-10-02 Riccardo Rende , Federica Gerace , Alessandro Laio , Sebastian Goldt

Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes…

Computation and Language · Computer Science 2023-11-30 Tong Xiao , Jingbo Zhu

Transformers have attained outstanding performance across various modalities, owing to their simple but powerful scaled-dot-product (SDP) attention mechanisms. Researchers have attempted to migrate Transformers to graph learning, but most…

Machine Learning · Computer Science 2026-01-30 Liheng Ma , Soumyasundar Pal , Yingxue Zhang , Philip H. S. Torr , Mark Coates

The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building…

Machine Learning · Computer Science 2022-11-10 Jason Ross Brown , Yiren Zhao , Ilia Shumailov , Robert D Mullins

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile…

Machine Learning · Computer Science 2026-03-25 Chenyang Zhang , Qingyue Zhao , Quanquan Gu , Yuan Cao

Training large transformer models from scratch for a target task requires lots of data and is computationally demanding. The usual practice of transfer learning overcomes this challenge by initializing the model with weights of a pretrained…

Transformers have become one of the most important architectural innovations in deep learning and have enabled many breakthroughs over the past few years. Here we propose a simple network architecture, gMLP, based on MLPs with gating, and…

Machine Learning · Computer Science 2021-06-03 Hanxiao Liu , Zihang Dai , David R. So , Quoc V. Le

Although scaling laws and many empirical results suggest that increasing the size of Vision Transformers often improves performance, model accuracy and training behavior are not always monotonically increasing with scale. Focusing on…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Anantha Padmanaban Krishna Kumar

Parameter-efficient fine-tuning approaches have recently garnered a lot of attention. Having considerably lower number of trainable weights, these methods can bring about scalability and computational effectiveness. In this paper, we look…

Computation and Language · Computer Science 2023-02-23 Mohammad Akbar-Tajari , Sara Rajaee , Mohammad Taher Pilehvar
‹ Prev 1 2 3 10 Next ›