Related papers: Transformer tricks: Precomputing the first layer

HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation

Language models with the Transformers structure have shown great performance in natural language processing. However, there still poses problems when fine-tuning pre-trained language models on downstream tasks, such as over-fitting or…

Computation and Language · Computer Science 2023-05-12 Hongyi Yuan , Zheng Yuan , Chuanqi Tan , Fei Huang , Songfang Huang

Reformer: The Efficient Transformer

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of…

Machine Learning · Computer Science 2020-02-19 Nikita Kitaev , Łukasz Kaiser , Anselm Levskaya

Jump to Conclusions: Short-Cutting Transformers With Linear Transformations

Transformer-based language models create hidden representations of their inputs at every layer, but only use final-layer representations for prediction. This obscures the internal decision-making process of the model and the utility of its…

Computation and Language · Computer Science 2024-06-21 Alexander Yom Din , Taelin Karidi , Leshem Choshen , Mor Geva

Transkimmer: Transformer Learns to Layer-wise Skim

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational…

Computation and Language · Computer Science 2022-05-17 Yue Guan , Zhengyi Li , Jingwen Leng , Zhouhan Lin , Minyi Guo

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Structural Sensitivity in Compressed Transformers: Relative Error Propagation and Layer Removal

Compressing transformer weights makes large language models cheaper to deploy. But each layer's compression introduces an error. These errors accumulate as the signal passes through later layers, and how they accumulate is not well…

Machine Learning · Computer Science 2026-05-08 Abhinaba Basu , Kumkum Basu , Koushik Deb

Bag of Tricks for Optimizing Transformer Efficiency

Improving Transformer efficiency has become increasingly attractive recently. A wide range of methods has been proposed, e.g., pruning, quantization, new architectures and etc. But these methods are either sophisticated in implementation or…

Machine Learning · Computer Science 2021-09-10 Ye Lin , Yanyang Li , Tong Xiao , Jingbo Zhu

What's Hidden in a One-layer Randomly Weighted Transformer?

We demonstrate that, hidden within one-layer randomly weighted neural networks, there exist subnetworks that can achieve impressive performance, without ever modifying the weight initializations, on machine translation tasks. To find…

Computation and Language · Computer Science 2021-09-10 Sheng Shen , Zhewei Yao , Douwe Kiela , Kurt Keutzer , Michael W. Mahoney

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

Transformer Layers as Painters

Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a…

Computation and Language · Computer Science 2025-02-14 Qi Sun , Marc Pickett , Aakash Kumar Nain , Llion Jones

You Do Not Fully Utilize Transformer's Representation Capacity

In contrast to RNNs, which compress their history into a single hidden state, Transformers can attend to all past tokens directly. However, standard Transformers rely solely on the hidden state from the previous layer to represent the…

Machine Learning · Computer Science 2025-05-29 Gleb Gerasimov , Yaroslav Aksenov , Nikita Balagansky , Viacheslav Sinii , Daniil Gavrilov

Jointly Reparametrized Multi-Layer Adaptation for Efficient and Private Tuning

Efficient finetuning of pretrained language transformers is becoming increasingly prevalent for solving natural language processing tasks. While effective, it can still require a large number of tunable parameters. This can be a drawback…

Computation and Language · Computer Science 2023-05-31 Umang Gupta , Aram Galstyan , Greg Ver Steeg

SparseSwaps: Tractable LLM Pruning Mask Refinement at Scale

The resource requirements of neural networks can be significantly reduced through pruning - the removal of seemingly less important parameters. However, for LLMs, full retraining to recover pruning-induced performance degradation is often…

Machine Learning · Computer Science 2026-02-03 Max Zimmer , Christophe Roux , Moritz Wagner , Deborah Hendrych , Sebastian Pokutta

Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit "layer tying" as form…

Computation and Language · Computer Science 2025-03-03 Sangmin Bae , Adam Fisch , Hrayr Harutyunyan , Ziwei Ji , Seungyeon Kim , Tal Schuster

Brainformers: Trading Simplicity for Efficiency

Transformers are central to recent successes in natural language processing and computer vision. Transformers have a mostly uniform backbone where layers alternate between feed-forward and self-attention in order to build a deep network.…

Machine Learning · Computer Science 2024-04-26 Yanqi Zhou , Nan Du , Yanping Huang , Daiyi Peng , Chang Lan , Da Huang , Siamak Shakeri , David So , Andrew Dai , Yifeng Lu , Zhifeng Chen , Quoc Le , Claire Cui , James Laudon , Jeff Dean

Greedy-layer Pruning: Speeding up Transformer Models for Natural Language Processing

Fine-tuning transformer models after unsupervised pre-training reaches a very high performance on many different natural language processing tasks. Unfortunately, transformers suffer from long inference times which greatly increases costs…

Computation and Language · Computer Science 2022-03-30 David Peer , Sebastian Stabinger , Stefan Engl , Antonio Rodriguez-Sanchez

Dynamic Layer Tying for Parameter-Efficient Transformers

In the pursuit of reducing the number of trainable parameters in deep transformer networks, we employ Reinforcement Learning to dynamically select layers during training and tie them together. Every few iterations, the RL agent is asked…

Machine Learning · Computer Science 2024-01-24 Tamir David Hay , Lior Wolf

A transformer architecture alteration to incentivise externalised reasoning

We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at…

Artificial Intelligence · Computer Science 2026-03-25 Elizabeth Pavlova , Mariia Koroliuk , Karthik Viswanathan , Cameron Tice , Edward James Young , Puria Radmard

Adaptive Layer-skipping in Pre-trained LLMs

Various layer-skipping methods have been proposed to accelerate token generation in large language models (LLMs). However, limited attention has been paid to a fundamental question: How do computational demands vary across the generation of…

Computation and Language · Computer Science 2025-10-10 Xuan Luo , Weizhi Wang , Xifeng Yan

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer