English
Related papers

Related papers: Patient Knowledge Distillation for BERT Model Comp…

200 papers

Knowledge distillation is an effective technique for pre-trained language model compression. Although existing knowledge distillation methods perform well for the most typical model BERT, they could be further improved in two aspects: the…

Computation and Language · Computer Science 2024-07-04 Ying Zhang , Ziheng Yang , Shufan Ji

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge model sizes hinder their applications in many practical systems. Knowledge distillation is a popular technique to compress PLMs, which learns a small…

Computation and Language · Computer Science 2021-06-03 Chuhan Wu , Fangzhao Wu , Yongfeng Huang

Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a…

Computation and Language · Computer Science 2020-05-04 Linqing Liu , Huan Wang , Jimmy Lin , Richard Socher , Caiming Xiong

Knowledge distillation is considered as a training and compression strategy in which two neural networks, namely a teacher and a student, are coupled together during training. The teacher network is supposed to be a trustworthy predictor…

Computation and Language · Computer Science 2020-12-29 Peyman Passban , Yimeng Wu , Mehdi Rezagholizadeh , Qun Liu

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple…

Computation and Language · Computer Science 2022-11-30 Zixiang Ding , Guoqing Jiang , Shuai Zhang , Lin Guo , Wei Lin

Knowledge distillation (KD) is one of the prominent techniques for model compression. In this method, the knowledge of a large network (teacher) is distilled into a model (student) with usually significantly fewer parameters. KD tries to…

Machine Learning · Computer Science 2023-01-31 Aref Jafari , Mehdi Rezagholizadeh , Ali Ghodsi

Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of natural language processing tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to…

Computation and Language · Computer Science 2021-06-22 Lingyun Feng , Minghui Qiu , Yaliang Li , Hai-Tao Zheng , Ying Shen

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications.…

Computation and Language · Computer Science 2022-11-03 Haojie Pan , Chengyu Wang , Minghui Qiu , Yichang Zhang , Yaliang Li , Jun Huang

In natural language processing (NLP) tasks, slow inference speed and huge footprints in GPU usage remain the bottleneck of applying pre-trained deep models in production. As a popular method for model compression, knowledge distillation…

Computation and Language · Computer Science 2020-12-15 Fei Yuan , Linjun Shou , Jian Pei , Wutao Lin , Ming Gong , Yan Fu , Daxin Jiang

The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that…

Computation and Language · Computer Science 2023-08-29 Apoorv Dankar , Adeem Jassani , Kartikaeya Kumar

How can we efficiently compress a model while maintaining its performance? Knowledge Distillation (KD) is one of the widely known methods for model compression. In essence, KD trains a smaller student model based on a larger teacher model…

Machine Learning · Computer Science 2020-12-14 Ikhyun Cho , U Kang

Deep neural networks (DNNs) have improved NLP tasks significantly, but training and maintaining such networks could be costly. Model compression techniques, such as, knowledge distillation (KD), have been proposed to address the issue;…

Computation and Language · Computer Science 2023-11-08 Manas Mohanty , Tanya Roosta , Peyman Passban

Knowledge distillation (KD) is an effective model compression technique where a compact student network is taught to mimic the behavior of a complex and highly trained teacher network. In contrast, Mutual Learning (ML) provides an…

Computer Vision and Pattern Recognition · Computer Science 2021-10-25 Usma Niyaz , Deepti R. Bathula

Pretrained language models like BERT have achieved good results on NLP tasks, but are impractical on resource-limited devices due to memory footprint. A large fraction of this footprint comes from the input embeddings with large input…

Computation and Language · Computer Science 2021-02-09 Sanqiang Zhao , Raghav Gupta , Yang Song , Denny Zhou

Pre-trained language models (e.g., BERT) have achieved significant success in various natural language processing (NLP) tasks. However, high storage and computational costs obstruct pre-trained language models to be effectively deployed on…

Computation and Language · Computer Science 2020-10-14 Jianquan Li , Xiaokang Liu , Honghong Zhao , Ruifeng Xu , Min Yang , Yaohong Jin

Knowledge distillation has attracted a great deal of interest recently to compress pre-trained language models. However, existing knowledge distillation methods suffer from two limitations. First, the student model simply imitates the…

Computation and Language · Computer Science 2023-05-18 Siyue Wu , Hongzhan Chen , Xiaojun Quan , Qifan Wang , Rui Wang

Knowledge distillation aims to compress a powerful yet cumbersome teacher model into a lightweight student model without much sacrifice of performance. For this purpose, various approaches have been proposed over the past few years,…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Defang Chen , Jian-Ping Mei , Hailin Zhang , Can Wang , Yan Feng , Chun Chen

Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or an ensemble of models (teacher). Various…

Machine Learning · Computer Science 2020-07-08 Fahad Sarfraz , Elahe Arani , Bahram Zonooz

Significant memory and computational requirements of large deep neural networks restrict their application on edge devices. Knowledge distillation (KD) is a prominent model compression technique for deep neural networks in which the…

Computation and Language · Computer Science 2021-04-16 Aref Jafari , Mehdi Rezagholizadeh , Pranav Sharma , Ali Ghodsi

This paper studies compressing pre-trained language models, like BERT (Devlin et al.,2019), via teacher-student knowledge distillation. Previous works usually force the student model to strictly mimic the smoothed labels predicted by the…

Computation and Language · Computer Science 2020-05-11 Xing Wu , Yibing Liu , Xiangyang Zhou , Dianhai Yu
‹ Prev 1 2 3 10 Next ›