Related papers: Progressively Stacking 2.0: A Multi-stage Layerwis…

Boosting Distributed Training Performance of the Unpadded BERT Model

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-18 Jinle Zeng , Min Li , Zhihua Wu , Jiaqi Liu , Yuang Liu , Dianhai Yu , Yanjun Ma

MML: Maximal Multiverse Learning for Robust Fine-Tuning of Language Models

Recent state-of-the-art language models utilize a two-phase training procedure comprised of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. More recently, many studies have been focused…

Computation and Language · Computer Science 2019-11-15 Itzik Malkiel , Lior Wolf

bert2BERT: Towards Reusable Pretrained Language Models

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources and most of the models are trained from…

Computation and Language · Computer Science 2021-10-15 Cheng Chen , Yichun Yin , Lifeng Shang , Xin Jiang , Yujia Qin , Fengyu Wang , Zhi Wang , Xiao Chen , Zhiyuan Liu , Qun Liu

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current…

Machine Learning · Computer Science 2020-10-27 Minjia Zhang , Yuxiong He

Visualizing and Understanding the Effectiveness of BERT

Language model pre-training, such as BERT, has achieved remarkable results in many NLP tasks. However, it is unclear why the pre-training-then-fine-tuning paradigm can improve performance and generalization capability across different…

Computation and Language · Computer Science 2019-08-16 Yaru Hao , Li Dong , Furu Wei , Ke Xu

Hierarchical Multitask Learning Approach for BERT

Recent works show that learning contextualized embeddings for words is beneficial for downstream tasks. BERT is one successful example of this approach. It learns embeddings by solving two tasks, which are masked language model (masked LM)…

Computation and Language · Computer Science 2020-11-10 Çağla Aksoy , Alper Ahmetoğlu , Tunga Güngör

Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of…

Computation and Language · Computer Science 2026-02-06 Ji Zhao , Yufei Gu , Shitong Shao , Xun Zhou , Liang Xiang , Zeke Xie

Breaking MLPerf Training: A Case Study on Optimizing BERT

Speeding up the large-scale distributed training is challenging in that it requires improving various components of training including load balancing, communication, optimizers, etc. We present novel approaches for fast large-scale training…

Machine Learning · Computer Science 2024-02-06 Yongdeok Kim , Jaehyung Ahn , Myeongwoo Kim , Changin Choi , Heejae Kim , Narankhuu Tuvshinjargal , Seungwon Lee , Yanzi Zhang , Yuan Pei , Xiongzhan Linghu , Jingkun Ma , Lin Chen , Yuehua Dai , Sungjoo Yoo

On the Effectiveness of Incremental Training of Large Language Models

Training large language models is a computationally intensive process that often requires substantial resources to achieve state-of-the-art results. Incremental layer-wise training has been proposed as a potential strategy to optimize the…

Computation and Language · Computer Science 2024-12-02 Miles Q. Li , Benjamin C. M. Fung , Shih-Chia Huang

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Large-scale pre-trained language models such as BERT have contributed significantly to the development of NLP. However, those models require large computational resources, making it difficult to be applied to mobile devices where computing…

Computation and Language · Computer Science 2023-08-02 Weixin Wu , Hankz Hankui Zhuo

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach…

Computation and Language · Computer Science 2021-05-26 Deming Ye , Yankai Lin , Yufei Huang , Maosong Sun

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

BERT has recently attracted a lot of attention in natural language understanding (NLU) and achieved state-of-the-art results in various NLU tasks. However, its success requires large deep neural networks and huge amount of data, which…

Machine Learning · Computer Science 2020-09-21 Shuai Zheng , Haibin Lin , Sheng Zha , Mu Li

CoRe: An Efficient Coarse-refined Training Framework for BERT

In recent years, BERT has made significant breakthroughs on many natural language processing tasks and attracted great attentions. Despite its accuracy gains, the BERT model generally involves a huge number of parameters and needs to be…

Computation and Language · Computer Science 2021-02-19 Cheng Yang , Shengnan Wang , Yuechuan Li , Chao Yang , Ming Yan , Jingqiao Zhang , Fangquan Lin

On the Transformer Growth for Progressive BERT Training

Due to the excessive cost of large-scale language model pre-training, considerable efforts have been made to train BERT progressively -- start from an inferior but low-cost model and gradually grow the model to increase the computational…

Computation and Language · Computer Science 2021-07-13 Xiaotao Gu , Liyuan Liu , Hongkun Yu , Jing Li , Chen Chen , Jiawei Han

Distilling Knowledge Learned in BERT for Text Generation

Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach,…

Computation and Language · Computer Science 2020-07-21 Yen-Chun Chen , Zhe Gan , Yu Cheng , Jingzhou Liu , Jingjing Liu

BoostingBERT:Integrating Multi-Class Boosting into BERT for NLP Tasks

As a pre-trained Transformer model, BERT (Bidirectional Encoder Representations from Transformers) has achieved ground-breaking performance on multiple NLP tasks. On the other hand, Boosting is a popular ensemble learning technique which…

Computation and Language · Computer Science 2020-09-15 Tongwen Huang , Qingyun She , Junlin Zhang

A Multi-Level Framework for Accelerating Training Transformer Models

The fast growing capabilities of large-scale deep learning models, such as Bert, GPT and ViT, are revolutionizing the landscape of NLP, CV and many other domains. Training such models, however, poses an unprecedented demand for computing…

Machine Learning · Computer Science 2024-04-15 Longwei Zou , Han Zhang , Yangdong Deng

Multi-stage Pre-training over Simplified Multimodal Pre-training Models

Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them difficult to apply in…

Computation and Language · Computer Science 2021-08-02 Tongtong Liu , Fangxiang Feng , Xiaojie Wang

Speeding up Deep Model Training by Sharing Weights and Then Unsharing

We propose a simple and efficient approach for training the BERT model. Our approach exploits the special structure of BERT that contains a stack of repeated modules (i.e., transformer encoders). Our proposed approach first trains BERT with…

Machine Learning · Computer Science 2021-10-11 Shuo Yang , Le Hou , Xiaodan Song , Qiang Liu , Denny Zhou