Related papers: Synchformer: Efficient Synchronization from Sparse…

Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors

The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Vladimir Iashin , Weidi Xie , Esa Rahtu , Andrew Zisserman

Audio-Visual Synchronisation in the wild

In this paper, we consider the problem of audio-visual synchronisation applied to videos `in-the-wild' (ie of general classes beyond speech). As a new task, we identify and curate a test set with high audio-visual correlation, namely…

Computer Vision and Pattern Recognition · Computer Science 2021-12-09 Honglie Chen , Weidi Xie , Triantafyllos Afouras , Arsha Nagrani , Andrea Vedaldi , Andrew Zisserman

Syncphony: Synchronized Audio-to-Video Generation with Diffusion Transformers

Text-to-video and image-to-video generation have made rapid progress in visual quality, but they remain limited in controlling the precise timing of motion. In contrast, audio provides temporal cues aligned with video motion, making it a…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Jibin Song , Mingi Kwon , Jaeseok Jeong , Youngjung Uh

Scaling Up Audio-Synchronized Visual Animation: An Efficient Training Paradigm

Recent advances in audio-synchronized visual animation enable control of video content using audios from specific classes. However, existing methods rely heavily on expensive manual curation of high-quality, class-specific training videos,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Lin Zhang , Zefan Cai , Yufan Zhou , Shentong Mo , Jinhong Lin , Cheng-En Wu , Yibing Wei , Yijing Zhang , Ruiyi Zhang , Wen Xiao , Tong Sun , Junjie Hu , Pedro Morgado

Self-Supervised Visual Acoustic Matching

Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment. Existing methods assume access to paired training data, where the audio is observed in both source and target…

Multimedia · Computer Science 2023-11-27 Arjun Somayazulu , Changan Chen , Kristen Grauman

Visual Acoustic Matching

We introduce the visual acoustic matching task, in which an audio clip is transformed to sound like it was recorded in a target environment. Given an image of the target environment and a waveform for the source audio, the goal is to…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Changan Chen , Ruohan Gao , Paul Calamia , Kristen Grauman

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal…

Computer Vision and Pattern Recognition · Computer Science 2018-11-13 Bruno Korbar , Du Tran , Lorenzo Torresani

Improving On-Screen Sound Separation for Open-Domain Videos with Audio-Visual Self-Attention

We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous…

Sound · Computer Science 2021-10-15 Efthymios Tzinis , Scott Wisdom , Tal Remez , John R. Hershey

Generating Realistic Images from In-the-wild Sounds

Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and images and the significant differences in the characteristics of these two modalities. Previous studies have…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Taegyeong Lee , Jeonghun Kang , Hyeonyu Kim , Taehwan Kim

Audio-Visual Feature Synchronization for Robust Speech Enhancement in Hearing Aids

Audio-visual feature synchronization for real-time speech enhancement in hearing aids represents a progressive approach to improving speech intelligibility and user experience, particularly in strong noisy backgrounds. This approach…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Nasir Saleem , Mandar Gogate , Kia Dashtipour , Adeel Hussain , Usman Anwar , Adewale Adetomi , Tughrul Arslan , Amir Hussain

ASIC: Aligning Sparse in-the-wild Image Collections

We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-29 Kamal Gupta , Varun Jampani , Carlos Esteves , Abhinav Shrivastava , Ameesh Makadia , Noah Snavely , Abhishek Kar

A Comprehensive Review and Taxonomy of Audio-Visual Synchronization Techniques for Realistic Speech Animation

In many applications, synchronizing audio with visuals is crucial, such as in creating graphic animations for films or games, translating movie audio into different languages, and developing metaverse applications. This review explores…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-29 Jose Geraldo Fernandes , Sinval Nascimento , Daniel Dominguete , André Oliveira , Lucas Rotsen , Gabriel Souza , David Brochero , Luiz Facury , Mateus Vilela , Hebert Costa , Frederico Coelho , Antônio P. Braga

A Convolutional-Attentional Neural Framework for Structure-Aware Performance-Score Synchronization

Performance-score synchronization is an integral task in signal processing, which entails generating an accurate mapping between an audio recording of a performance and the corresponding musical score. Traditional synchronization methods…

Sound · Computer Science 2022-04-20 Ruchit Agrawal , Daniel Wolff , Simon Dixon

Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds

Recent progress in deep learning has enabled many advances in sound separation and visual scene understanding. However, extracting sound sources which are apparent in natural videos remains an open problem. In this work, we present…

Sound · Computer Science 2021-06-01 Efthymios Tzinis , Scott Wisdom , Aren Jansen , Shawn Hershey , Tal Remez , Daniel P. W. Ellis , John R. Hershey

A Synchronized Audio-Visual Multi-View Capture System

Multi-view capture systems have been an important tool in research for recording human motion under controlling conditions. Most existing systems are specified around video streams and provide little or no support for audio acquisition and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-23 Xiangwei Shi , Gara Dorta , Ruud de Jong , Ojas Shirekar , Chirag Raman

Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Teng Hu , Zhentao Yu , Guozhen Zhang , Zihan Su , Zhengguang Zhou , Youliang Zhang , Yuan Zhou , Qinglin Lu , Ran Yi

Self-Supervised Audio-Visual Soundscape Stylization

Speech sounds convey a great deal of information about the scenes, resulting in a variety of effects ranging from reverberation to additional ambient sounds. In this paper, we manipulate input speech to sound as though it was recorded…

Computer Vision and Pattern Recognition · Computer Science 2024-09-24 Tingle Li , Renhao Wang , Po-Yao Huang , Andrew Owens , Gopala Anumanchipalli

Beyond Audio and Pose: A General-Purpose Framework for Video Synchronization

Video synchronization-aligning multiple video streams capturing the same event from different angles-is crucial for applications such as reality TV show production, sports analysis, surveillance, and autonomous systems. Prior work has…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Yosub Shin , Igor Molybog

PEAVS: Perceptual Evaluation of Audio-Visual Synchrony Grounded in Viewers' Opinion Scores

Recent advancements in audio-visual generative modeling have been propelled by progress in deep learning and the availability of data-rich benchmarks. However, the growth is not attributed solely to models and benchmarks. Universally…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Lucas Goncalves , Prashant Mathur , Chandrashekhar Lavania , Metehan Cekic , Marcello Federico , Kyu J. Han

Visual-Aware Speech Recognition for Noisy Scenarios

Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech…

Computation and Language · Computer Science 2025-04-11 Lakshmipathi Balaji , Karan Singla