Related papers: Diff2Lip: Audio Conditioned Diffusion Models for L…

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their…

Computer Vision and Pattern Recognition · Computer Science 2022-07-01 Venkatesh S. Kadandale , Juan F. Montesinos , Gloria Haro

A Lip Sync Expert Is All You Need for Speech to Lip Generation In The Wild

In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. Current works excel at producing accurate lip movements on a static image or videos of specific people…

Computer Vision and Pattern Recognition · Computer Science 2020-08-25 K R Prajwal , Rudrabha Mukhopadhyay , Vinay Namboodiri , C V Jawahar

HighSync: High-Quality Lip Synchronization via Latent Diffusion Models

We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Saeed Firouzi Daghigh , Majid Iranpour Mobarekeh , Mostafa Alavi , Mehdi Bagheri

LPIPS-AttnWav2Lip: Generic Audio-Driven lip synchronization for Talking Head Generation in the Wild

Researchers have shown a growing interest in Audio-driven Talking Head Generation. The primary challenge in talking head generation is achieving audio-visual coherence between the lips and the audio, known as lip synchronization. This paper…

Sound · Computer Science 2026-02-03 Zhipeng Chen , Xinheng Wang , Lun Xie , Haijie Yuan , Hang Pan

Style-Preserving Lip Sync via Audio-Aware Style Reference

Audio-driven lip sync has recently drawn significant attention due to its widespread application in the multimedia domain. Individuals exhibit distinct lip shapes when speaking the same utterance, attributed to the unique speaking styles of…

Computer Vision and Pattern Recognition · Computer Science 2025-06-19 Weizhi Zhong , Jichang Li , Yinqi Cai , Ming Li , Feng Gao , Liang Lin , Guanbin Li

OmniSync: Towards Universal Lip Synchronization via Diffusion Transformers

Lip synchronization is the task of aligning a speaker's lip movements in video with corresponding speech audio, and it is essential for creating realistic, expressive video content. However, existing methods often rely on reference frames…

Computer Vision and Pattern Recognition · Computer Science 2025-09-19 Ziqiao Peng , Jiwen Liu , Haoxian Zhang , Xiaoqiang Liu , Songlin Tang , Pengfei Wan , Di Zhang , Hongyan Liu , Jun He

Data standardization for robust lip sync

Lip sync is a fundamental audio-visual task. However, existing lip sync methods fall short of being robust in the wild. One important cause could be distracting factors on the visual input side, making extracting lip motion information…

Computer Vision and Pattern Recognition · Computer Science 2024-09-10 Chun Wang

SyncDiff: Diffusion-based Talking Head Synthesis with Bottlenecked Temporal Visual Prior for Improved Synchronization

Talking head synthesis, also known as speech-to-lip synthesis, reconstructs the facial motions that align with the given audio tracks. The synthesized videos are evaluated on mainly two aspects, lip-speech synchronization and image…

Machine Learning · Computer Science 2025-03-18 Xulin Fan , Heting Gao , Ziyi Chen , Peng Chang , Mei Han , Mark Hasegawa-Johnson

Lips Are Lying: Spotting the Temporal Inconsistency between Audio and Visual in Lip-Syncing DeepFakes

In recent years, DeepFake technology has achieved unprecedented success in high-quality video synthesis, but these methods also pose potential and severe security threats to humanity. DeepFake can be bifurcated into entertainment…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Weifeng Liu , Tianyi She , Jiawei Liu , Boheng Li , Dongyu Yao , Ziyou Liang , Run Wang

SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory

The challenge of talking face generation from speech lies in aligning two different modal information, audio and video, such that the mouth region corresponds to input audio. Previous methods either exploit audio-visual representation…

Computer Vision and Pattern Recognition · Computer Science 2022-11-04 Se Jin Park , Minsu Kim , Joanna Hong , Jeongsoo Choi , Yong Man Ro

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Ruidi Fan , Yang Zhou , Siyuan Wang , Tian Yu , Yutong Jiang , Xusheng Liu

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Dogucan Yaman , Fevziye Irem Eyiokur , Leonard Bärmann , Hazim Kemal Ekenel , Alexander Waibel

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering

Generating semantically coherent and visually accurate talking faces requires bridging the gap between linguistic meaning and facial articulation. Although audio-driven methods remain prevalent, their reliance on high-quality paired audio…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Xu Wang , Shengeng Tang , Fei Wang , Lechao Cheng , Dan Guo , Feng Xue , Richang Hong

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the…

Sound · Computer Science 2024-12-24 Lucas Goncalves , Prashant Mathur , Xing Niu , Brady Houston , Chandrashekhar Lavania , Srikanth Vishnubhotla , Lijia Sun , Anthony Ferritto

KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution

Lip synchronization, known as the task of aligning lip movements in an existing video with new input audio, is typically framed as a simpler variant of audio-driven facial animation. However, as well as suffering from the usual issues in…

Computer Vision and Pattern Recognition · Computer Science 2025-05-02 Antoni Bigata , Rodrigo Mira , Stella Bounareli , Michał Stypułkowski , Konstantinos Vougioukas , Stavros Petridis , Maja Pantic

SyncAnyone: Implicit Disentanglement via Progressive Self-Correction for Lip-Syncing in the wild

High-quality AI-powered video dubbing demands precise audio-lip synchronization, high-fidelity visual generation, and faithful preservation of identity and background. Most existing methods rely on a mask-based training strategy, where the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-09 Xindi Zhang , Dechao Meng , Steven Xiao , Qi Wang , Peng Zhang , Bang Zhang

LatentSync: Taming Audio-Conditioned Latent Diffusion Models for Lip Sync with SyncNet Supervision

End-to-end audio-conditioned latent diffusion models (LDMs) have been widely adopted for audio-driven portrait animation, demonstrating their effectiveness in generating lifelike and high-resolution talking videos. However, direct…

Computer Vision and Pattern Recognition · Computer Science 2025-03-14 Chunyu Li , Chao Zhang , Weikai Xu , Jingyu Lin , Jinghui Xie , Weiguo Feng , Bingyue Peng , Cunjian Chen , Weiwei Xing

SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion

Recent advances in diffusion models have led to significant progress in audio-driven lip synchronization. However, existing methods typically rely on constrained audio-visual alignment priors or multi-stage learning of intermediate…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Junxian Ma , Shiwen Wang , Jian Yang , Junyi Hu , Jian Liang , Guosheng Lin , Jingbo chen , Kai Li , Yu Meng

Attention-Based Lip Audio-Visual Synthesis for Talking Face Generation in the Wild

Talking face generation with great practical significance has attracted more attention in recent audio-visual studies. How to achieve accurate lip synchronization is a long-standing challenge to be further investigated. Motivated by xxx, in…

Computer Vision and Pattern Recognition · Computer Science 2022-04-12 Ganglai Wang , Peng Zhang , Lei Xie , Wei Huang , Yufei Zha

Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement

We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2)…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Runyi Yu , Tianyu He , Ailing Zhang , Yuchi Wang , Junliang Guo , Xu Tan , Chang Liu , Jie Chen , Jiang Bian