English

Modality Dropout for Improved Performance-driven Talking Faces

Audio and Speech Processing 2020-05-29 v1 Machine Learning Sound Machine Learning

Abstract

We describe our novel deep learning approach for driving animated faces using both acoustic and visual information. In particular, speech-related facial movements are generated using audiovisual information, and non-speech facial movements are generated using only visual information. To ensure that our model exploits both modalities during training, batches are generated that contain audio-only, video-only, and audiovisual input features. The probability of dropping a modality allows control over the degree to which the model exploits audio and visual information during training. Our trained model runs in real-time on resource limited hardware (e.g.\ a smart phone), it is user agnostic, and it is not dependent on a potentially error-prone transcription of the speech. We use subjective testing to demonstrate: 1) the improvement of audiovisual-driven animation over the equivalent video-only approach, and 2) the improvement in the animation of speech-related facial movements after introducing modality dropout. Before introducing dropout, viewers prefer audiovisual-driven animation in 51% of the test sequences compared with only 18% for video-driven. After introducing dropout viewer preference for audiovisual-driven animation increases to 74%, but decreases to 8% for video-only.

Keywords

Cite

@article{arxiv.2005.13616,
  title  = {Modality Dropout for Improved Performance-driven Talking Faces},
  author = {Ahmed Hussen Abdelaziz and Barry-John Theobald and Paul Dixon and Reinhard Knothe and Nicholas Apostoloff and Sachin Kajareker},
  journal= {arXiv preprint arXiv:2005.13616},
  year   = {2020}
}

Comments

Pre-print

R2 v1 2026-06-23T15:51:57.081Z