Related papers: Puppet Dubbing

Dynamic Temporal Alignment of Speech to Lips

Many speech segments in movies are re-recorded in a studio during postproduction, to compensate for poor sound quality as recorded on location. Manual alignment of the newly-recorded speech with the original lip movements is a tedious task.…

Computer Vision and Pattern Recognition · Computer Science 2018-08-21 Tavi Halperin , Ariel Ephrat , Shmuel Peleg

Prosodic Phrase Alignment for Machine Dubbing

Dubbing is a type of audiovisual translation where dialogues are translated and enacted so that they give the impression that the media is in the target language. It requires a careful alignment of dubbed recordings with the lip movements…

Computation and Language · Computer Science 2019-08-21 Alp Öktem , Mireia Farrús , Antonio Bonafonte

InstructDubber: Instruction-based Alignment for Zero-shot Movie Dubbing

Movie dubbing seeks to synthesize speech from a given script using a specific voice, while ensuring accurate lip synchronization and emotion-prosody alignment with the character's visual performance. However, existing alignment approaches…

Sound · Computer Science 2025-12-22 Zhedong Zhang , Liang Li , Gaoxiang Cong , Chunshan Liu , Yuhan Gao , Xiaowan Wang , Tao Gu , Yuankai Qi

Prosodic Alignment for off-screen automatic dubbing

The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses,…

Computation and Language · Computer Science 2022-04-07 Yogesh Virkar , Marcello Federico , Robert Enyedi , Roberto Barra-Chicote

Neural Dubber: Dubbing for Videos According to Scripts

Dubbing is a post-production process of re-recording actors' dialogues, which is extensively used in filmmaking and video production. It is usually performed manually by professional voice actors who read lines with proper prosody, and in…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-16 Chenxu Hu , Qiao Tian , Tingle Li , Yuping Wang , Yuxuan Wang , Hang Zhao

Identity-Preserving Video Dubbing Using Motion Warping

Video dubbing aims to synthesize realistic, lip-synced videos from a reference video and a driving audio signal. Although existing methods can accurately generate mouth shapes driven by audio, they often fail to preserve identity-specific…

Computer Vision and Pattern Recognition · Computer Science 2025-01-10 Runzhen Liu , Qinjie Lin , Yunfei Liu , Lijian Lin , Ye Zhu , Yu Li , Chuhua Xian , Fa-Ting Hong

Dubbing for Everyone: Data-Efficient Visual Dubbing using Neural Rendering Priors

Visual dubbing is the process of generating lip motions of an actor in a video to synchronise with given audio. Recent advances have made progress towards this goal but have not been able to produce an approach suitable for mass adoption.…

Computer Vision and Pattern Recognition · Computer Science 2024-01-12 Jack Saunders , Vinay Namboodiri

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on…

Sound · Computer Science 2024-06-11 Nikhil Singh , Chih-Wei Wu , Iroro Orife , Mahdi Kalayeh

Towards Realistic Visual Dubbing with Heterogeneous Sources

The task of few-shot visual dubbing focuses on synchronizing the lip movements with arbitrary speech input for any talking head video. Albeit moderate improvements in current approaches, they commonly require high-quality homologous data…

Computer Vision and Pattern Recognition · Computer Science 2022-01-19 Tianyi Xie , Liucheng Liao , Cheng Bi , Benlai Tang , Xiang Yin , Jianfei Yang , Mingjie Wang , Jiali Yao , Yang Zhang , Zejun Ma

Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing

Automatic dubbing (AD) is the task of translating the original speech in a video into target language speech. The new target language speech should satisfy isochrony; that is, the new speech should be time aligned with the original video,…

Computation and Language · Computer Science 2023-02-28 Alexandra Chronopoulou , Brian Thompson , Prashant Mathur , Yogesh Virkar , Surafel M. Lakew , Marcello Federico

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired…

Computer Vision and Pattern Recognition · Computer Science 2025-04-04 Kim Sung-Bin , Jeongsoo Choi , Puyuan Peng , Joon Son Chung , Tae-Hyun Oh , David Harwath

PS-TTS: Phonetic Synchronization in Text-to-Speech for Achieving Natural Automated Dubbing

Recently, artificial intelligence-based dubbing technology has advanced, enabling automated dubbing (AD) to convert the source speech of a video into target speech in different languages. However, natural AD still faces synchronization…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-05 Changi Hong , Yoonah Song , Hwayoung Park , Chaewoon Bang , Dayeon Ku , Do Hyun Lee , Hong Kook Kim

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing…

Multimedia · Computer Science 2025-08-26 Gaoxiang Cong , Liang Li , Jiadong Pan , Zhedong Zhang , Amin Beheshti , Anton van den Hengel , Yuankai Qi , Qingming Huang

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing

Movie dubbing describes the process of transforming a script into speech that aligns temporally and emotionally with a given movie clip while exemplifying the speaker's voice demonstrated in a short reference audio clip. This task demands…

Sound · Computer Science 2025-03-19 Zhedong Zhang , Liang Li , Chenggang Yan , Chunshan Liu , Anton van den Hengel , Yuankai Qi

Video Editing for Audio-Visual Dubbing

Visual dubbing, the synchronization of facial movements with new speech, is crucial for making content accessible across different languages, enabling broader global reach. However, current methods face significant limitations. Existing…

Computer Vision and Pattern Recognition · Computer Science 2025-05-30 Binyamin Manela , Sharon Gannot , Ethan Fetyaya

FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

Full-duplex dialog models aim to listen and speak simultaneously, delivering rapid responses to dynamic user input. Among different solutions to full-duplexity, a native solution merges multiple channels in each time step, achieving the…

Sound · Computer Science 2026-02-02 Yiqun Yao , Xiang Li , Xin Jiang , Xuezhi Fang , Naitong Yu , Wenjia Ma , Aixin Sun , Yequan Wang

Neural Voice Puppetry: Audio-driven Facial Reenactment

We present Neural Voice Puppetry, a novel approach for audio-driven facial video synthesis. Given an audio sequence of a source person or digital assistant, we generate a photo-realistic output video of a target person that is in sync with…

Computer Vision and Pattern Recognition · Computer Science 2020-07-30 Justus Thies , Mohamed Elgharib , Ayush Tewari , Christian Theobalt , Matthias Nießner

VideoDubber: Machine Translation with Speech-Aware Length Control for Video Dubbing

Video dubbing aims to translate the original speech in a film or television program into the speech in a target language, which can be achieved with a cascaded system consisting of speech recognition, machine translation and speech…

Computation and Language · Computer Science 2023-12-06 Yihan Wu , Junliang Guo , Xu Tan , Chen Zhang , Bohan Li , Ruihua Song , Lei He , Sheng Zhao , Arul Menezes , Jiang Bian

Neural Style-Preserving Visual Dubbing

Dubbing is a technique for translating video content from one language to another. However, state-of-the-art visual dubbing techniques directly copy facial expressions from source to target actors without considering identity-specific…

Computer Vision and Pattern Recognition · Computer Science 2019-09-09 Hyeongwoo Kim , Mohamed Elgharib , Michael Zollhöfer , Hans-Peter Seidel , Thabo Beeler , Christian Richardt , Christian Theobalt

Audio-Driven Dubbing for User Generated Contents via Style-Aware Semi-Parametric Synthesis

Existing automated dubbing methods are usually designed for Professionally Generated Content (PGC) production, which requires massive training data and training time to learn a person-specific audio-video mapping. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2023-09-04 Linsen Song , Wayne Wu , Chaoyou Fu , Chen Change Loy , Ran He