Related papers: Token Dropping for Efficient BERT Pretraining

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training…

Computation and Language · Computer Science 2023-05-25 Qihuang Zhong , Liang Ding , Juhua Liu , Xuebo Liu , Min Zhang , Bo Du , Dacheng Tao

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach…

Computation and Language · Computer Science 2021-05-26 Deming Ye , Yankai Lin , Yufei Huang , Maosong Sun

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a…

Computation and Language · Computer Science 2022-11-22 Zhewei Yao , Xiaoxia Wu , Conglong Li , Connor Holmes , Minjia Zhang , Cheng Li , Yuxiong He

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current…

Machine Learning · Computer Science 2020-10-27 Minjia Zhang , Yuxiong He

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a…

Computation and Language · Computer Science 2022-03-18 Ali Modarressi , Hosein Mohebbi , Mohammad Taher Pilehvar

Learning to Skip for Language Modeling

Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the…

Computation and Language · Computer Science 2023-11-28 Dewen Zeng , Nan Du , Tao Wang , Yuanzhong Xu , Tao Lei , Zhifeng Chen , Claire Cui

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts…

Computation and Language · Computer Science 2024-06-04 Jungmin Yun , Mihyeon Kim , Youngbin Kim

Efficient pre-training objectives for Transformers

The Transformer architecture deeply changed the natural language processing, outperforming all previous state-of-the-art models. However, well-known Transformer models like BERT, RoBERTa, and GPT-2 require a huge compute budget to create a…

Computation and Language · Computer Science 2021-04-21 Luca Di Liello , Matteo Gabburo , Alessandro Moschitti

Token Drop mechanism for Neural Machine Translation

Neural machine translation with millions of parameters is vulnerable to unfamiliar inputs. We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token…

Computation and Language · Computer Science 2020-10-22 Huaao Zhang , Shigui Qiu , Xiangyu Duan , Min Zhang

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to…

Computation and Language · Computer Science 2020-03-25 Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

Accelerating BERT Inference for Sequence Labeling via Early-Exit

Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their…

Computation and Language · Computer Science 2021-06-15 Xiaonan Li , Yunfan Shao , Tianxiang Sun , Hang Yan , Xipeng Qiu , Xuanjing Huang

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and…

Computation and Language · Computer Science 2022-11-16 Baohao Liao , David Thulke , Sanjika Hewavitharana , Hermann Ney , Christof Monz

Efficient Pre-Training with Token Superposition

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training…

Computation and Language · Computer Science 2026-05-20 Bowen Peng , Théo Gigant , Jeffrey Quesnelle

Robust Transfer Learning with Pretrained Language Models through Adapters

Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific…

Computation and Language · Computer Science 2021-08-06 Wenjuan Han , Bo Pang , Yingnian Wu

Weighted Sampling for Masked Language Modeling

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased toward high-frequency tokens. Representation learning of rare…

Computation and Language · Computer Science 2023-05-25 Linhan Zhang , Qian Chen , Wen Wang , Chong Deng , Xin Cao , Kongzhang Hao , Yuxin Jiang , Wei Wang

Frustratingly Simple Pretraining Alternatives to Masked Language Modeling

Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced…

Computation and Language · Computer Science 2021-09-07 Atsuki Yamaguchi , George Chrysostomou , Katerina Margatina , Nikolaos Aletras

BudgetLongformer: Can we Cheaply Pretrain a SotA Legal Language Model From Scratch?

Pretrained transformer models have achieved state-of-the-art results in many tasks and benchmarks recently. Many state-of-the-art Language Models (LMs), however, do not scale well above the threshold of 512 input tokens. In specialized…

Computation and Language · Computer Science 2022-12-01 Joel Niklaus , Daniele Giofré

No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream…

Machine Learning · Computer Science 2023-11-15 Jean Kaddour , Oscar Key , Piotr Nawrot , Pasquale Minervini , Matt J. Kusner

A Comprehensive Comparison of Pre-training Language Models

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of…

Computation and Language · Computer Science 2023-07-27 Tong Guo

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer