Related papers: A Short Study on Compressing Decoder-Based Languag…

Improving Knowledge Distillation for BERT Models: Loss Functions, Mapping Methods, and Weight Tuning

The use of large transformer-based models such as BERT, GPT, and T5 has led to significant advancements in natural language processing. However, these models are computationally expensive, necessitating model compression techniques that…

Computation and Language · Computer Science 2023-08-29 Apoorv Dankar , Adeem Jassani , Kartikaeya Kumar

Exploring Extreme Parameter Compression for Pre-trained Language Models

Recent work explored the potential of large-scale Transformer-based pre-trained models, especially Pre-trained Language Models (PLMs) in natural language processing. This raises many concerns from various perspectives, e.g., financial costs…

Computation and Language · Computer Science 2022-05-23 Yuxin Ren , Benyou Wang , Lifeng Shang , Xin Jiang , Qun Liu

One Teacher is Enough? Pre-trained Language Model Distillation from Multiple Teachers

Pre-trained language models (PLMs) achieve great success in NLP. However, their huge model sizes hinder their applications in many practical systems. Knowledge distillation is a popular technique to compress PLMs, which learns a small…

Computation and Language · Computer Science 2021-06-03 Chuhan Wu , Fangzhao Wu , Yongfeng Huang

Kronecker Decomposition for GPT Compression

GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success…

Computation and Language · Computer Science 2021-10-18 Ali Edalati , Marzieh Tahaei , Ahmad Rashid , Vahid Partovi Nia , James J. Clark , Mehdi Rezagholizadeh

Robustness Challenges in Model Distillation and Pruning for Natural Language Understanding

Recent work has focused on compressing pre-trained language models (PLMs) like BERT where the major focus has been to improve the in-distribution performance for downstream tasks. However, very few of these studies have analyzed the impact…

Computation and Language · Computer Science 2023-02-28 Mengnan Du , Subhabrata Mukherjee , Yu Cheng , Milad Shokouhi , Xia Hu , Ahmed Hassan Awadallah

Well-Read Students Learn Better: On the Importance of Pre-training Compact Models

Recent developments in natural language representations have been accompanied by large and expensive models that leverage vast amounts of general-domain text through self-supervised pre-training. Due to the cost of applying such models to…

Computation and Language · Computer Science 2019-09-27 Iulia Turc , Ming-Wei Chang , Kenton Lee , Kristina Toutanova

Patient Knowledge Distillation for BERT Model Compression

Pre-trained language models such as BERT have proven to be highly effective for natural language processing (NLP) tasks. However, the high demand for computing resources in training such models hinders their application in practice. In…

Computation and Language · Computer Science 2019-08-27 Siqi Sun , Yu Cheng , Zhe Gan , Jingjing Liu

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and…

Machine Learning · Computer Science 2021-09-29 Prakhar Ganesh , Yao Chen , Xin Lou , Mohammad Ali Khan , Yin Yang , Hassan Sajjad , Preslav Nakov , Deming Chen , Marianne Winslett

A Survey on Model Compression for Large Language Models

Large Language Models (LLMs) have transformed natural language processing tasks successfully. Yet, their large size and high computational needs pose challenges for practical use, especially in resource-limited settings. Model compression…

Computation and Language · Computer Science 2024-07-31 Xunyu Zhu , Jian Li , Yong Liu , Can Ma , Weiping Wang

Compression of Generative Pre-trained Language Models via Quantization

The increasing size of generative Pre-trained Language Models (PLMs) has greatly increased the demand for model compression. Despite various methods to compress BERT or its variants, there are few attempts to compress generative PLMs, and…

Computation and Language · Computer Science 2022-07-19 Chaofan Tao , Lu Hou , Wei Zhang , Lifeng Shang , Xin Jiang , Qun Liu , Ping Luo , Ngai Wong

D2LLM: Decomposed and Distilled Large Language Models for Semantic Search

The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss…

Computation and Language · Computer Science 2024-06-26 Zihan Liao , Hang Yu , Jianguo Li , Jun Wang , Wei Zhang

KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power,…

Computation and Language · Computer Science 2021-09-15 Marzieh S. Tahaei , Ella Charlaix , Vahid Partovi Nia , Ali Ghodsi , Mehdi Rezagholizadeh

Compression of Deep Learning Models for Text: A Survey

In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long…

Computation and Language · Computer Science 2021-06-15 Manish Gupta , Puneet Agrawal

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains…

Computation and Language · Computer Science 2020-03-03 Victor Sanh , Lysandre Debut , Julien Chaumond , Thomas Wolf

Extremely Small BERT Models from Mixed-Vocabulary Training

Pretrained language models like BERT have achieved good results on NLP tasks, but are impractical on resource-limited devices due to memory footprint. A large fraction of this footprint comes from the input embeddings with large input…

Computation and Language · Computer Science 2021-02-09 Sanqiang Zhao , Raghav Gupta , Yang Song , Denny Zhou

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges…

Computation and Language · Computer Science 2020-04-07 Wenhui Wang , Furu Wei , Li Dong , Hangbo Bao , Nan Yang , Ming Zhou

Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation

The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Miao Rang , Zhenni Bi , Hang Zhou , Hanting Chen , An Xiao , Tianyu Guo , Kai Han , Xinghao Chen , Yunhe Wang

Distilling HuBERT with LSTMs via Decoupled Knowledge Distillation

Much research effort is being applied to the task of compressing the knowledge of self-supervised models, which are powerful, yet large and memory consuming. In this work, we show that the original method of knowledge distillation (and its…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-19 Danilo de Oliveira , Timo Gerkmann

Differentially Private Model Compression

Recent papers have shown that large pre-trained language models (LLMs) such as BERT, GPT-2 can be fine-tuned on private data to achieve performance comparable to non-private models for many downstream Natural Language Processing (NLP) tasks…

Machine Learning · Computer Science 2022-06-07 Fatemehsadat Mireshghallah , Arturs Backurs , Huseyin A Inan , Lukas Wutschitz , Janardhan Kulkarni

Iterative Layer-wise Distillation for Efficient Compression of Large Language Models

This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective…

Computation and Language · Computer Science 2025-11-10 Grigory Kovalev , Mikhail Tikhomirov