Homecs.MMarXiv:2605.29590

State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

cs.MM2026-05v1license

Abstract

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.

Comments: 25 pages, 5 figures

Cite

@article{arxiv.2605.29590,
  title  = {State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition},
  author = {Zhaoyan Pan and Xiangdong Li and Wenke Wu and Mengting Ma and Ye Lou and Ji Zhou and Jiatong Pan and Wei Zhang},
  journal= {arXiv preprint arXiv:2605.29590},
  year   = {2026}
}