Related papers: Scalable Syntax-Aware Language Models Using Knowle…

A Cohesive Distillation Architecture for Neural Language Models

A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study…

Computation and Language · Computer Science 2023-01-31 Jan Philip Wahle

Memorization Dynamics in Knowledge Distillation for Language Models

Knowledge Distillation (KD) is increasingly adopted to transfer capabilities from large language models to smaller ones, offering significant improvements in efficiency and utility while often surpassing standard fine-tuning. Beyond…

Computation and Language · Computer Science 2026-01-23 Jaydeep Borkar , Karan Chadha , Niloofar Mireshghallah , Yuchen Zhang , Irina-Elena Veliche , Archi Mitra , David A. Smith , Zheng Xu , Diego Garcia-Olano

Syntactic Structure Distillation Pretraining For Bidirectional Encoders

Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Given this success, it remains an…

Computation and Language · Computer Science 2020-05-28 Adhiguna Kuncoro , Lingpeng Kong , Daniel Fried , Dani Yogatama , Laura Rimell , Chris Dyer , Phil Blunsom

Feature Alignment and Representation Transfer in Knowledge Distillation for Large Language Models

Knowledge distillation (KD) is a technique for transferring knowledge from complex teacher models to simpler student models, significantly enhancing model efficiency and accuracy. It has demonstrated substantial advancements in various…

Computation and Language · Computer Science 2025-04-21 Junjie Yang , Junhao Song , Xudong Han , Ziqian Bi , Tianyang Wang , Chia Xin Liang , Xinyuan Song , Yichao Zhang , Qian Niu , Benji Peng , Keyu Chen , Ming Liu

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

Pretrained language models have led to significant performance gains in many NLP tasks. However, the intensive computing resources to train such models remain an issue. Knowledge distillation alleviates this problem by learning a…

Computation and Language · Computer Science 2020-05-04 Linqing Liu , Huan Wang , Jimmy Lin , Richard Socher , Caiming Xiong

On the Generalization vs Fidelity Paradox in Knowledge Distillation

Knowledge distillation (KD) is a key technique for compressing large language models into smaller ones while preserving performance. Despite the recent traction of KD research, its effectiveness for smaller language models (LMs) and the…

Computation and Language · Computer Science 2025-08-05 Suhas Kamasetty Ramesh , Ayan Sengupta , Tanmoy Chakraborty

Structural Knowledge Distillation for Object Detection

Knowledge Distillation (KD) is a well-known training paradigm in deep neural networks where knowledge acquired by a large teacher model is transferred to a small student. KD has proven to be an effective technique to significantly improve…

Computer Vision and Pattern Recognition · Computer Science 2022-11-24 Philip de Rijk , Lukas Schneider , Marius Cordts , Dariu M. Gavrila

A New Training Framework for Deep Neural Network

Knowledge distillation is the process of transferring the knowledge from a large model to a small model. In this process, the small model learns the generalization ability of the large model and retains the performance close to that of the…

Machine Learning · Computer Science 2021-03-26 Zhenyan Hou , Wenxuan Fan

Knowledge Distillation Performs Partial Variance Reduction

Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread…

Machine Learning · Computer Science 2023-12-12 Mher Safaryan , Alexandra Peste , Dan Alistarh

Sequence-Level Knowledge Distillation for Class-Incremental End-to-End Spoken Language Understanding

The ability to learn new concepts sequentially is a major weakness for modern neural networks, which hinders their use in non-stationary environments. Their propensity to fit the current data distribution to the detriment of the past…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-02 Umberto Cappellazzo , Muqiao Yang , Daniele Falavigna , Alessio Brutti

An Investigation of the Combination of Rehearsal and Knowledge Distillation in Continual Learning for Spoken Language Understanding

Continual learning refers to a dynamical framework in which a model receives a stream of non-stationary data over time and must adapt to new data while preserving previously acquired knowledge. Unluckily, neural networks fail to meet these…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-24 Umberto Cappellazzo , Daniele Falavigna , Alessio Brutti

DistiLLM: Towards Streamlined Distillation for Large Language Models

Knowledge distillation (KD) is widely used for compressing a teacher model to a smaller student model, reducing its inference cost and memory footprint while preserving model capabilities. However, current KD methods for auto-regressive…

Computation and Language · Computer Science 2024-07-04 Jongwoo Ko , Sungnyun Kim , Tianyi Chen , Se-Young Yun

Explaining Sequence-Level Knowledge Distillation as Data-Augmentation for Neural Machine Translation

Sequence-level knowledge distillation (SLKD) is a model compression technique that leverages large, accurate teacher models to train smaller, under-parameterized student models. Why does pre-processing MT data with SLKD help us train…

Computation and Language · Computer Science 2019-12-10 Mitchell A. Gordon , Kevin Duh

Talking Models: Distill Pre-trained Knowledge to Downstream Models via Interactive Communication

Many recent breakthroughs in machine learning have been enabled by the pre-trained foundation models. By scaling up model parameters, training data, and computation resources, foundation models have significantly advanced the…

Artificial Intelligence · Computer Science 2023-10-06 Zhe Zhao , Qingyun Liu , Huan Gui , Bang An , Lichan Hong , Ed H. Chi

Does Knowledge Distillation Matter for Large Language Model based Bundle Generation?

LLMs are increasingly explored for bundle generation, thanks to their reasoning capabilities and knowledge. However, deploying large-scale LLMs introduces significant efficiency challenges, primarily high computational costs during…

Computation and Language · Computer Science 2025-04-25 Kaidong Feng , Zhu Sun , Jie Yang , Hui Fang , Xinghua Qu , Wenyuan Liu

Knowledge Distillation with Structured Chain-of-Thought for Text-to-SQL

Deploying accurate Text-to-SQL systems at the enterprise level faces a difficult trilemma involving cost, security and performance. Current solutions force enterprises to choose between expensive, proprietary Large Language Models (LLMs)…

Computation and Language · Computer Science 2026-03-13 Khushboo Thaker , Yony Bresler

From Teacher to Student: Tracking Memorization Through Model Distillation

Large language models (LLMs) are known to memorize parts of their training data, raising important concerns around privacy and security. While previous research has focused on studying memorization in pre-trained models, much less is known…

Machine Learning · Computer Science 2025-08-19 Simardeep Singh

Dynamic Knowledge Distillation for Pre-trained Language Models

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models. However, existing methods conduct KD statically, e.g., the student model aligns its output distribution to that of a selected…

Computation and Language · Computer Science 2021-09-24 Lei Li , Yankai Lin , Shuhuai Ren , Peng Li , Jie Zhou , Xu Sun

DDK: Distilling Domain Knowledge for Efficient Large Language Models

Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve…

Computation and Language · Computer Science 2024-07-24 Jiaheng Liu , Chenchen Zhang , Jinyang Guo , Yuanxing Zhang , Haoran Que , Ken Deng , Zhiqi Bai , Jie Liu , Ge Zhang , Jiakai Wang , Yanan Wu , Congnan Liu , Wenbo Su , Jiamang Wang , Lin Qu , Bo Zheng

An Empirical Study of Leveraging Knowledge Distillation for Compressing Multilingual Neural Machine Translation Models

Knowledge distillation (KD) is a well-known method for compressing neural models. However, works focusing on distilling knowledge from large multilingual neural machine translation (MNMT) models into smaller ones are practically…

Computation and Language · Computer Science 2023-04-20 Varun Gumma , Raj Dabre , Pratyush Kumar