English
Related papers

Related papers: Structure-Level Knowledge Distillation For Multili…

200 papers

Pre-trained multilingual language models play an important role in cross-lingual natural language understanding tasks. However, existing methods did not focus on learning the semantic structure of representation, and thus could not optimize…

Computation and Language · Computer Science 2022-11-03 Mingqi Li , Fei Ding , Dan Zhang , Long Cheng , Hongxin Hu , Feng Luo

Multilingual models have been widely used for cross-lingual transfer to low-resource languages. However, the performance on these languages is hindered by their underrepresentation in the pretraining data. To alleviate this problem, we…

Computation and Language · Computer Science 2023-05-29 Tomasz Limisiewicz , Dan Malkin , Gabriel Stanovsky

This paper explores sequence-level knowledge distillation (KD) of multilingual pre-trained encoder-decoder translation models. We argue that the teacher model's output distribution holds valuable insights for the student, beyond the…

A recent trend in Natural Language Processing is the exponential growth in Language Model (LM) size, which prevents research groups without a necessary hardware infrastructure from participating in the development process. This study…

Computation and Language · Computer Science 2023-01-31 Jan Philip Wahle

Knowledge distillation (KD) is an effective model compression method that can transfer the internal capabilities of large language models (LLMs) to smaller ones. However, the multi-modal probability distribution predicted by teacher LLMs…

Computation and Language · Computer Science 2024-12-19 Tianyu Peng , Jiajun Zhang

Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models,…

Computation and Language · Computer Science 2025-01-22 Xiao Cui , Mo Zhu , Yulei Qin , Liang Xie , Wengang Zhou , Houqiang Li

Structured prediction models aim at solving a type of problem where the output is a complex structure, rather than a single variable. Performing knowledge distillation for such models is not trivial due to their exponentially large output…

Machine Learning · Computer Science 2022-03-10 Wenye Lin , Yangming Li , Lemao Liu , Shuming Shi , Hai-tao Zheng

Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning.…

Machine Learning · Computer Science 2025-06-02 Penghui Yang , Ming-Kun Xie , Chen-Chen Zong , Lei Feng , Gang Niu , Masashi Sugiyama , Sheng-Jun Huang

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated…

Computation and Language · Computer Science 2020-10-06 Nils Reimers , Iryna Gurevych

Providing technologies to communities or domains where training data is scarce or protected e.g., for privacy reasons, is becoming increasingly important. To that end, we generalise methods for unsupervised transfer from multiple input…

Computation and Language · Computer Science 2021-10-11 Kemal Kurniawan , Lea Frermann , Philip Schulz , Trevor Cohn

Probing the multilingual knowledge of linguistic structure in LLMs, often characterized as sequence labeling, faces challenges with maintaining output templates in current text-to-text prompting strategies. To solve this, we introduce a…

Computation and Language · Computer Science 2025-11-07 Ercong Nie , Shuzhou Yuan , Bolei Ma , Helmut Schmid , Michael Färber , Frauke Kreuter , Hinrich Schütze

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications.…

Computation and Language · Computer Science 2022-11-03 Haojie Pan , Chengyu Wang , Minghui Qiu , Yichang Zhang , Yaliang Li , Jun Huang

We study semi-supervised sequence generation tasks, where the few labeled examples are too scarce to finetune a model, and meanwhile, few-shot prompted large language models (LLMs) exhibit room for improvement. In this paper, we present the…

Computation and Language · Computer Science 2024-08-06 Jiachen Zhao , Wenlong Zhao , Andrew Drozdov , Benjamin Rozonoyer , Md Arafat Sultan , Jay-Yoon Lee , Mohit Iyyer , Andrew McCallum

Hierarchical attention networks have recently achieved remarkable performance for document classification in a given language. However, when multilingual document collections are considered, training such models separately for each language…

Computation and Language · Computer Science 2017-09-18 Nikolaos Pappas , Andrei Popescu-Belis

Knowledge distillation, transferring knowledge from a teacher model to a student model, has emerged as a powerful technique in neural machine translation for compressing models or simplifying training targets. Knowledge distillation…

Computation and Language · Computer Science 2024-04-24 Jingxuan Wei , Linzhuang Sun , Yichong Leng , Xu Tan , Bihui Yu , Ruifeng Guo

In this paper, we propose Stochastic Knowledge Distillation (SKD) to obtain compact BERT-style language model dubbed SKDBERT. In each iteration, SKD samples a teacher model from a pre-defined teacher ensemble, which consists of multiple…

Computation and Language · Computer Science 2022-11-30 Zixiang Ding , Guoqing Jiang , Shuai Zhang , Lin Guo , Wei Lin

Current knowledge distillation (KD) methods for semantic segmentation focus on guiding the student to imitate the teacher's knowledge within homogeneous architectures. However, these methods overlook the diverse knowledge contained in…

Machine Learning · Computer Science 2025-04-11 Yanglin Huang , Kai Hu , Yuan Zhang , Zhineng Chen , Xieping Gao

Crossmodal knowledge distillation (KD) aims to enhance a unimodal student using a multimodal teacher model. In particular, when the teacher's modalities include the student's, additional complementary information can be exploited to improve…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Chenqi Guo , Mengshuo Rong , Qianli Feng , Rongfan Feng , Yinglong Ma

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders…

Computation and Language · Computer Science 2019-06-18 Adhiguna Kuncoro , Chris Dyer , Laura Rimell , Stephen Clark , Phil Blunsom

Knowledge Distillation (KD) is a well-known training paradigm in deep neural networks where knowledge acquired by a large teacher model is transferred to a small student. KD has proven to be an effective technique to significantly improve…

Computer Vision and Pattern Recognition · Computer Science 2022-11-24 Philip de Rijk , Lukas Schneider , Marius Cordts , Dariu M. Gavrila
‹ Prev 1 2 3 10 Next ›