Related papers: Optimizing Deeper Transformers on Small Datasets

Layer-Wise Evolution of Representations in Fine-Tuned Transformers: Insights from Sparse AutoEncoders

Fine-tuning pre-trained transformers is a powerful technique for enhancing the performance of base models on specific tasks. From early applications in models like BERT to fine-tuning Large Language Models (LLMs), this approach has been…

Computation and Language · Computer Science 2025-02-25 Suneel Nadipalli

Making the most of small Software Engineering datasets with modern machine learning

This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software…

Software Engineering · Computer Science 2021-06-30 Julian Aron Prenner , Romain Robbes

Shallow-to-Deep Training for Neural Machine Translation

Deep encoders have been proven to be effective in improving neural machine translation (NMT) systems, but training an extremely deep encoder is time consuming. Moreover, why deep models help NMT is an open question. In this paper, we…

Computation and Language · Computer Science 2020-10-09 Bei Li , Ziyang Wang , Hui Liu , Yufan Jiang , Quan Du , Tong Xiao , Huizhen Wang , Jingbo Zhu

Going deeper with Image Transformers

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 Hugo Touvron , Matthieu Cord , Alexandre Sablayrolles , Gabriel Synnaeve , Hervé Jégou

Lipschitz Constrained Parameter Initialization for Deep Transformers

The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder/decoder structure. Previous research shows that even with residual connection and…

Computation and Language · Computer Science 2020-05-06 Hongfei Xu , Qiuhui Liu , Josef van Genabith , Deyi Xiong , Jingyi Zhang

Top-Tuning: a study on transfer learning for an efficient alternative to fine tuning for image classification with fast kernel methods

The impressive performance of deep learning architectures is associated with a massive increase in model complexity. Millions of parameters need to be tuned, with training and inference time scaling accordingly, together with energy…

Machine Learning · Computer Science 2023-11-10 Paolo Didier Alfano , Vito Paolo Pastore , Lorenzo Rosasco , Francesca Odone

How fine can fine-tuning be? Learning efficient language models

State-of-the-art performance on language understanding tasks is now achieved with increasingly large networks; the current record holder has billions of parameters. Given a language model pre-trained on massive unlabeled text corpora, only…

Computation and Language · Computer Science 2020-04-30 Evani Radiya-Dixit , Xin Wang

Learning Deep Transformer Models for Machine Translation

Transformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto…

Computation and Language · Computer Science 2019-06-06 Qiang Wang , Bei Li , Tong Xiao , Jingbo Zhu , Changliang Li , Derek F. Wong , Lidia S. Chao

Training Deep Networks from Zero to Hero: avoiding pitfalls and going beyond

Training deep neural networks may be challenging in real world data. Using models as black-boxes, even with transfer learning, can result in poor generalization or inconclusive results when it comes to small datasets or specific…

Machine Learning · Computer Science 2021-10-14 Moacir Antonelli Ponti , Fernando Pereira dos Santos , Leo Sampaio Ferraz Ribeiro , Gabriel Biscaro Cavallari

Deep Progressive Training: scaling up depth capacity of zero/one-layer models

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, an effective strategy is the progressive training, which scales up…

Machine Learning · Computer Science 2025-11-10 Zhiqi Bu

Depth-Adaptive Transformer

State of the art sequence-to-sequence models for large scale tasks perform a fixed number of computations for each input sequence regardless of whether it is easy or hard to process. In this paper, we train Transformer models which can make…

Computation and Language · Computer Science 2020-02-18 Maha Elbayad , Jiatao Gu , Edouard Grave , Michael Auli

Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring

Understanding whether deep neural networks are effectively optimized remains challenging, as training occurs in highly nonconvex landscapes and standard metrics provide limited visibility into layer-wise learning quality. This challenge is…

Machine Learning · Computer Science 2026-05-05 Arian Eamaz , Farhang Yeganegi , Mojtaba Soltanalian

Lightweight Transformers for Zero-Shot and Fine-Tuned Text-to-SQL Generation Using Spider

Text-to-SQL translation enables non-expert users to query relational databases using natural language, with applications in education and business intelligence. This study evaluates three lightweight transformer models - T5-Small,…

Computation and Language · Computer Science 2025-08-07 Chirag Seth , Utkarsh Singh

Leaner Transformers: More Heads, Less Depth

Transformers have reshaped machine learning by utilizing attention mechanisms to capture complex patterns in large datasets, leading to significant improvements in performance. This success has contributed to the belief that "bigger means…

Machine Learning · Computer Science 2025-05-28 Hemanth Saratchandran , Damien Teney , Simon Lucey

DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction

There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the…

Computation and Language · Computer Science 2023-11-06 Mohammadreza Pourreza , Davood Rafiei

Downstream Datasets Make Surprisingly Good Pretraining Corpora

For most natural language processing tasks, the dominant practice is to finetune large pretrained transformer models (e.g., BERT) using smaller downstream datasets. Despite the success of this approach, it remains unclear to what extent…

Computation and Language · Computer Science 2023-05-29 Kundan Krishna , Saurabh Garg , Jeffrey P. Bigham , Zachary C. Lipton

Transformers for End-to-End InfoSec Tasks: A Feasibility Study

In this paper, we assess the viability of transformer models in end-to-end InfoSec settings, in which no intermediate feature representations or processing steps occur outside the model. We implement transformer models for two distinct…

Machine Learning · Computer Science 2022-12-07 Ethan M. Rudd , Mohammad Saidur Rahman , Philip Tully

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to…

Computation and Language · Computer Science 2020-10-19 Xian Li , Asa Cooper Stickland , Yuqing Tang , Xiang Kong

Remote Sensing Change Detection With Transformers Trained from Scratch

Current transformer-based change detection (CD) approaches either employ a pre-trained model trained on large-scale image classification ImageNet dataset or rely on first pre-training on another CD dataset and then fine-tuning on the target…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Mubashir Noman , Mustansar Fiaz , Hisham Cholakkal , Sanath Narayan , Rao Muhammad Anwer , Salman Khan , Fahad Shahbaz Khan

Mimetic Initialization of Self-Attention Layers

It is notoriously difficult to train Transformers on small datasets; typically, large pre-trained models are instead used as the starting point. We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to…

Computer Vision and Pattern Recognition · Computer Science 2023-05-18 Asher Trockman , J. Zico Kolter