English

Synchformer: Efficient Synchronization from Sparse Cues

Computer Vision and Pattern Recognition 2024-01-30 v1 Machine Learning Multimedia Sound Audio and Speech Processing

Abstract

Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Keywords

Cite

@article{arxiv.2401.16423,
  title  = {Synchformer: Efficient Synchronization from Sparse Cues},
  author = {Vladimir Iashin and Weidi Xie and Esa Rahtu and Andrew Zisserman},
  journal= {arXiv preprint arXiv:2401.16423},
  year   = {2024}
}

Comments

Extended version of the ICASSP 24 paper. Project page: https://www.robots.ox.ac.uk/~vgg/research/synchformer/ Code: https://github.com/v-iashin/Synchformer

R2 v1 2026-06-28T14:30:39.132Z