Related papers: Efficient pre-training objectives for Transformers

Effective Pre-Training Objectives for Transformer-based Autoencoders

In this paper, we study trade-offs between efficiency, cost and accuracy when pre-training Transformer encoders with different pre-training objectives. For this purpose, we analyze features of common objectives and combine them to create…

Computation and Language · Computer Science 2022-10-26 Luca Di Liello , Matteo Gabburo , Alessandro Moschitti

Maximizing Efficiency of Language Model Pre-training for Learning Representation

Pre-trained language models in the past years have shown exponential growth in model parameters and compute time. ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models (e.g. BERT) based on masked…

Computation and Language · Computer Science 2021-10-14 Junmo Kang , Suwon Shin , Jeonghwan Kim , Jaeyoung Jo , Sung-Hyon Myaeng

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to…

Computation and Language · Computer Science 2020-03-25 Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

Fast-ELECTRA for Efficient Pre-training

ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. Although ELECTRA offers a significant boost in efficiency, its potential is constrained by the training cost brought by the…

Computation and Language · Computer Science 2023-10-12 Chengyu Dong , Liyuan Liu , Hao Cheng , Jingbo Shang , Jianfeng Gao , Xiaodong Liu

MC-BERT: Efficient Language Pre-Training via a Meta Controller

Pre-trained contextual representations (e.g., BERT) have become the foundation to achieve state-of-the-art results on many NLP tasks. However, large-scale pre-training is computationally expensive. ELECTRA, an early attempt to accelerate…

Computation and Language · Computer Science 2020-06-17 Zhenhui Xu , Linyuan Gong , Guolin Ke , Di He , Shuxin Zheng , Liwei Wang , Jiang Bian , Tie-Yan Liu

Training ELECTRA Augmented with Multi-word Selection

Pre-trained text encoders such as BERT and its variants have recently achieved state-of-the-art performances on many NLP tasks. While being effective, these pre-training methods typically demand massive computation resources. To accelerate…

Computation and Language · Computer Science 2022-03-04 Jiaming Shen , Jialu Liu , Tianqi Liu , Cong Yu , Jiawei Han

ELECTRA is a Zero-Shot Learner, Too

Recently, for few-shot or even zero-shot learning, the new paradigm "pre-train, prompt, and predict" has achieved remarkable achievements compared with the "pre-train, fine-tune" paradigm. After the success of prompt-based GPT-3, a series…

Computation and Language · Computer Science 2022-07-21 Shiwen Ni , Hung-Yu Kao

Token Dropping for Efficient BERT Pretraining

Transformer-based models generally allocate the same amount of computation for each token in a given sequence. We develop a simple but effective "token dropping" method to accelerate the pretraining of transformer models, such as BERT,…

Computation and Language · Computer Science 2022-03-25 Le Hou , Richard Yuanzhe Pang , Tianyi Zhou , Yuexin Wu , Xinying Song , Xiaodan Song , Denny Zhou

Early Transformers: A study on Efficient Training of Transformer Models through Early-Bird Lottery Tickets

The training of Transformer models has revolutionized natural language processing and computer vision, but it remains a resource-intensive and time-consuming process. This paper investigates the applicability of the early-bird ticket…

Computation and Language · Computer Science 2024-05-07 Shravan Cheekati

Learning to Sample Replacements for ELECTRA Pre-Training

ELECTRA pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues.…

Computation and Language · Computer Science 2021-06-28 Yaru Hao , Li Dong , Hangbo Bao , Ke Xu , Furu Wei

Pre-Training Transformers as Energy-Based Cloze Models

We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution…

Computation and Language · Computer Science 2020-12-17 Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream…

Machine Learning · Computer Science 2023-11-15 Jean Kaddour , Oscar Key , Piotr Nawrot , Pasquale Minervini , Matt J. Kusner

Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models

Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit…

Computation and Language · Computer Science 2022-10-28 Mengzhou Xia , Mikel Artetxe , Jingfei Du , Danqi Chen , Ves Stoyanov

Utilizing Bidirectional Encoder Representations from Transformers for Answer Selection

Pre-training a transformer-based model for the language modeling task in a large dataset and then fine-tuning it for downstream tasks has been found very useful in recent years. One major advantage of such pre-trained language models is…

Computation and Language · Computer Science 2020-11-17 Md Tahmid Rahman Laskar , Enamul Hoque , Jimmy Xiangji Huang

Efficient Fine-Tuning of Compressed Language Models with Learners

Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the…

Computation and Language · Computer Science 2022-08-04 Danilo Vucetic , Mohammadreza Tayaranian , Maryam Ziaeefard , James J. Clark , Brett H. Meyer , Warren J. Gross

CodeArt: Better Code Models by Attention Regularization When Symbols Are Lacking

Transformer based code models have impressive performance in many software engineering tasks. However, their effectiveness degrades when symbols are missing or not informative. The reason is that the model may not learn to pay attention to…

Software Engineering · Computer Science 2024-11-22 Zian Su , Xiangzhe Xu , Ziyang Huang , Zhuo Zhang , Yapeng Ye , Jianjun Huang , Xiangyu Zhang

How much pretraining data do language models need to learn syntax?

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study…

Computation and Language · Computer Science 2021-09-10 Laura Pérez-Mayos , Miguel Ballesteros , Leo Wanner

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and…

Computation and Language · Computer Science 2021-06-11 Ivan Chelombiev , Daniel Justus , Douglas Orr , Anastasia Dietrich , Frithjof Gressmann , Alexandros Koliousis , Carlo Luschi

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we…

Computation and Language · Computer Science 2019-07-29 Yinhan Liu , Myle Ott , Naman Goyal , Jingfei Du , Mandar Joshi , Danqi Chen , Omer Levy , Mike Lewis , Luke Zettlemoyer , Veselin Stoyanov

ModernBERT or DeBERTaV3? Examining Architecture and Data Influence on Transformer Encoder Models Performance

Pretrained transformer-encoder models like DeBERTaV3 and ModernBERT introduce architectural advancements aimed at improving efficiency and performance. Although the authors of ModernBERT report improved performance over DeBERTaV3 on several…

Computation and Language · Computer Science 2025-11-17 Wissam Antoun , Benoît Sagot , Djamé Seddah