Related papers: Structure-Level Knowledge Distillation For Multili…

Multi-level Distillation of Semantic Knowledge for Pre-training Multilingual Language Model

Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize…

Computation and Language · Computer Science 2022-11-03 Mingqi Li , Fei Ding , Dan Zhang , Long Cheng , Hongxin Hu , Feng Luo

You Can Have Your Data and Balance It Too: Towards Balanced and Efficient Multilingual Models

Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we…

Computation and Language · Computer Science 2023-05-29 Tomasz Limisiewicz , Dan Malkin , Gabriel Stanovsky

Multi-Hypothesis Distillation of Multilingual Neural Translation Models for Low-Resource Languages

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the…

Computation and Language · Computer Science 2025-08-01 Aarón Galiano-Jiménez , Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez , Víctor M. Sánchez-Cartagena

A Cohesive Distillation Architecture for Neural Language Models

A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study…

Computation and Language · Computer Science 2023-01-31 Jan Philip Wahle

Enhancing Knowledge Distillation of Large Language Models through Efficient Multi-Modal Distribution Alignment

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs…

Computation and Language · Computer Science 2024-12-19 Tianyu Peng , Jiajun Zhang

Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models

Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models,…

Computation and Language · Computer Science 2025-01-22 Xiao Cui , Mo Zhu , Yulei Qin , Liang Xie , Wengang Zhou , Houqiang Li

Efficient Sub-structured Knowledge Distillation

Structured prediction models aim at solving a type of problem where the output is a complex structure, rather than a single variable. Performing knowledge distillation for such models is not trivial due to their exponentially large output…

Machine Learning · Computer Science 2022-03-10 Wenye Lin , Yangming Li , Lemao Liu , Shuming Shi , Hai-tao Zheng

Multi-Label Knowledge Distillation

Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning.…

Machine Learning · Computer Science 2025-06-02 Penghui Yang , Ming-Kun Xie , Chen-Chen Zong , Lei Feng , Gang Niu , Masashi Sugiyama , Sheng-Jun Huang

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated…

Computation and Language · Computer Science 2020-10-06 Nils Reimers , Iryna Gurevych

Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data

Providing technologies to communities or domains where training data is scarce or protected e.g., for privacy reasons, is becoming increasingly important. To that end, we generalise methods for unsupervised transfer from multiple input…

Computation and Language · Computer Science 2021-10-11 Kemal Kurniawan , Lea Frermann , Philip Schulz , Trevor Cohn

Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language Models

Probing the multilingual knowledge of linguistic structure in LLMs, often characterized as sequence labeling, faces challenges with maintaining output templates in current text-to-text prompting strategies. To solve this, we introduce a…

Computation and Language · Computer Science 2025-11-07 Ercong Nie , Shuzhou Yuan , Bolei Ma , Helmut Schmid , Michael Färber , Frauke Kreuter , Hinrich Schütze

Meta-KD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications.…

Computation and Language · Computer Science 2022-11-03 Haojie Pan , Chengyu Wang , Minghui Qiu , Yichang Zhang , Yaliang Li , Jun Huang

Multistage Collaborative Knowledge Distillation from a Large Language Model for Semi-Supervised Sequence Generation

We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the…

Computation and Language · Computer Science 2024-08-06 Jiachen Zhao , Wenlong Zhao , Andrew Drozdov , Benjamin Rozonoyer , Md Arafat Sultan , Jay-Yoon Lee , Mohit Iyyer , Andrew McCallum

Multilingual Hierarchical Attention Networks for Document Classification

Hierarchical attention networks have recently achieved remarkable performance for document classification in a given language. However, when multilingual document collections are considered, training such models separately for each language…

Computation and Language · Computer Science 2017-09-18 Nikolaos Pappas , Andrei Popescu-Belis

Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation

Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation…

Computation and Language · Computer Science 2024-04-24 Jingxuan Wei , Linzhuang Sun , Yichong Leng , Xu Tan , Bihui Yu , Ruifeng Guo

SKDBERT: Compressing BERT via Stochastic Knowledge Distillation

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple…

Computation and Language · Computer Science 2022-11-30 Zixiang Ding , Guoqing Jiang , Shuai Zhang , Lin Guo , Wei Lin

Distilling Knowledge from Heterogeneous Architectures for Semantic Segmentation

Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in…

Machine Learning · Computer Science 2025-04-11 Yanglin Huang , Kai Hu , Yuan Zhang , Zhineng Chen , Xieping Gao

Crossmodal Knowledge Distillation with WordNet-Relaxed Text Embeddings for Robust Image Classification

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Chenqi Guo , Mengshuo Rong , Qianli Feng , Rongfan Feng , Yinglong Ma

Scalable Syntax-Aware Language Models Using Knowledge Distillation

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders…

Computation and Language · Computer Science 2019-06-18 Adhiguna Kuncoro , Chris Dyer , Laura Rimell , Stephen Clark , Phil Blunsom

Structural Knowledge Distillation for Object Detection

Knowledge Distillation (KD) is a well-known training paradigm in deep neural networks where knowledge acquired by a large teacher model is transferred to a small student. KD has proven to be an effective technique to significantly improve…

Computer Vision and Pattern Recognition · Computer Science 2022-11-24 Philip de Rijk , Lukas Schneider , Marius Cordts , Dariu M. Gavrila