English

Quantifying Knowledge Distillation Using Partial Information Decomposition

Machine Learning 2025-04-07 v2 Computer Vision and Pattern Recognition Information Theory Machine Learning Image and Video Processing math.IT

Abstract

Knowledge distillation deploys complex machine learning models in resource-constrained environments by training a smaller student model to emulate internal representations of a complex teacher model. However, the teacher's representations can also encode nuisance or additional information not relevant to the downstream task. Distilling such irrelevant information can actually impede the performance of a capacity-limited student model. This observation motivates our primary question: What are the information-theoretic limits of knowledge distillation? To this end, we leverage Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill for a downstream task. We theoretically demonstrate that the task-relevant transferred knowledge is succinctly captured by the measure of redundant information about the task between the teacher and student. We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID). RID leads to more resilient and effective distillation under nuisance teachers as it succinctly quantifies task-relevant knowledge rather than simply aligning student and teacher representations.

Keywords

Cite

@article{arxiv.2411.07483,
  title  = {Quantifying Knowledge Distillation Using Partial Information Decomposition},
  author = {Pasan Dissanayake and Faisal Hamman and Barproda Halder and Ilia Sucholutsky and Qiuyi Zhang and Sanghamitra Dutta},
  journal= {arXiv preprint arXiv:2411.07483},
  year   = {2025}
}

Comments

Accepted at the 28th International Conference on Artificial Intelligence and Statistics (AISTATS) 2025

R2 v1 2026-06-28T19:56:22.271Z