English
Related papers

Related papers: Token Dropping for Efficient BERT Pretraining

200 papers

Token dropping is a recently-proposed strategy to speed up the pretraining of masked language models, such as BERT, by skipping the computation of a subset of the input tokens at several middle layers. It can effectively reduce the training…

Computation and Language · Computer Science 2023-05-25 Qihuang Zhong , Liang Ding , Juhua Liu , Xuebo Liu , Min Zhang , Bo Du , Dacheng Tao

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach…

Computation and Language · Computer Science 2021-05-26 Deming Ye , Yankai Lin , Yufei Huang , Maosong Sun

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a…

Computation and Language · Computer Science 2022-11-22 Zhewei Yao , Xiaoxia Wu , Conglong Li , Connor Holmes , Minjia Zhang , Cheng Li , Yuxiong He

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current…

Machine Learning · Computer Science 2020-10-27 Minjia Zhang , Yuxiong He

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a…

Computation and Language · Computer Science 2022-03-18 Ali Modarressi , Hosein Mohebbi , Mohammad Taher Pilehvar

Overparameterized large-scale language models have impressive generalization performance of in-context few-shot learning. However, most language models allocate the same amount of parameters or computation to each token, disregarding the…

Computation and Language · Computer Science 2023-11-28 Dewen Zeng , Nan Du , Tao Wang , Yuanzhong Xu , Tao Lei , Zhifeng Chen , Claire Cui

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts…

Computation and Language · Computer Science 2024-06-04 Jungmin Yun , Mihyeon Kim , Youngbin Kim

The Transformer architecture deeply changed the natural language processing, outperforming all previous state-of-the-art models. However, well-known Transformer models like BERT, RoBERTa, and GPT-2 require a huge compute budget to create a…

Computation and Language · Computer Science 2021-04-21 Luca Di Liello , Matteo Gabburo , Alessandro Moschitti

Neural machine translation with millions of parameters is vulnerable to unfamiliar inputs. We propose Token Drop to improve generalization and avoid overfitting for the NMT model. Similar to word dropout, whereas we replace dropped token…

Computation and Language · Computer Science 2020-10-22 Huaao Zhang , Shigui Qiu , Xiangyu Duan , Min Zhang

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to…

Computation and Language · Computer Science 2020-03-25 Kevin Clark , Minh-Thang Luong , Quoc V. Le , Christopher D. Manning

Both performance and efficiency are crucial factors for sequence labeling tasks in many real-world scenarios. Although the pre-trained models (PTMs) have significantly improved the performance of various sequence labeling tasks, their…

Computation and Language · Computer Science 2021-06-15 Xiaonan Li , Yunfan Shao , Tianxiang Sun , Hang Yan , Xipeng Qiu , Xuanjing Huang

The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and…

Computation and Language · Computer Science 2022-11-16 Baohao Liao , David Thulke , Sanjika Hewavitharana , Hermann Ney , Christof Monz

Pre-training of Large Language Models is often prohibitively expensive and inefficient at scale, requiring complex and invasive modifications in order to achieve high data throughput. In this work, we present Token-Superposition Training…

Computation and Language · Computer Science 2026-05-20 Bowen Peng , Théo Gigant , Jeffrey Quesnelle

Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific…

Computation and Language · Computer Science 2021-08-06 Wenjuan Han , Bo Pang , Yingnian Wu

Masked Language Modeling (MLM) is widely used to pretrain language models. The standard random masking strategy in MLM causes the pre-trained language models (PLMs) to be biased toward high-frequency tokens. Representation learning of rare…

Computation and Language · Computer Science 2023-05-25 Linhan Zhang , Qian Chen , Wen Wang , Chong Deng , Xin Cao , Kongzhang Hao , Yuxin Jiang , Wei Wang

Masked language modeling (MLM), a self-supervised pretraining objective, is widely used in natural language processing for learning text representations. MLM trains a model to predict a random sample of input tokens that have been replaced…

Computation and Language · Computer Science 2021-09-07 Atsuki Yamaguchi , George Chrysostomou , Katerina Margatina , Nikolaos Aletras

Pretrained transformer models have achieved state-of-the-art results in many tasks and benchmarks recently. Many state-of-the-art Language Models (LMs), however, do not scale well above the threshold of 512 input tokens. In specialized…

Computation and Language · Computer Science 2022-12-01 Joel Niklaus , Daniele Giofré

The computation necessary for training Transformer-based language models has skyrocketed in recent years. This trend has motivated research on efficient training algorithms designed to improve training, validation, and downstream…

Machine Learning · Computer Science 2023-11-15 Jean Kaddour , Oscar Key , Piotr Nawrot , Pasquale Minervini , Matt J. Kusner

Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of…

Computation and Language · Computer Science 2023-07-27 Tong Guo

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer
‹ Prev 1 2 3 10 Next ›