Related papers: Learning Representations from Audio-Visual Spatial…

Self-supervised Learning of Audio Representations from Audio-Visual Data using Spatial Alignment

Learning from audio-visual data offers many possibilities to express correspondence between the audio and visual content, similar to the human perception that relates aural and visual information. In this work, we present a method for…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-23 Shanshan Wang , Archontis Politis , Annamaria Mesaros , Tuomas Virtanen

Telling Left from Right: Learning Spatial Correspondence of Sight and Sound

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the…

Computer Vision and Pattern Recognition · Computer Science 2020-06-15 Karren Yang , Bryan Russell , Justin Salamon

Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Sagnik Majumder , Ziad Al-Halah , Kristen Grauman

Learning Self-Supervised Audio-Visual Representations for Sound Recommendations

We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional…

Computer Vision and Pattern Recognition · Computer Science 2024-12-11 Sudha Krishnamurthy

Self-supervised Video Representation Learning by Context and Motion Decoupling

A key challenge in self-supervised video representation learning is how to effectively capture motion information besides context bias. While most existing works implicitly achieve this with video-specific pretext tasks (e.g., predicting…

Computer Vision and Pattern Recognition · Computer Science 2021-04-05 Lianghua Huang , Yu Liu , Bin Wang , Pan Pan , Yinghui Xu , Rong Jin

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be…

Multimedia · Computer Science 2020-08-19 Ying Cheng , Ruize Wang , Zhihao Pan , Rui Feng , Yuejie Zhang

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial…

Computer Vision and Pattern Recognition · Computer Science 2021-02-01 Jiangliu Wang , Jianbo Jiao , Linchao Bao , Shengfeng He , Wei Liu , Yun-hui Liu

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Jiangliu Wang , Jianbo Jiao , Linchao Bao , Shengfeng He , Yunhui Liu , Wei Liu

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400)…

Computer Vision and Pattern Recognition · Computer Science 2023-03-14 Ziyun Zeng , Yuying Ge , Xihui Liu , Bin Chen , Ping Luo , Shu-Tao Xia , Yixiao Ge

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal…

Computer Vision and Pattern Recognition · Computer Science 2018-11-13 Bruno Korbar , Du Tran , Lorenzo Torresani

Rethinking Self-supervised Correspondence Learning: A Video Frame-level Similarity Perspective

Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for…

Computer Vision and Pattern Recognition · Computer Science 2021-10-15 Jiarui Xu , Xiaolong Wang

Contrastive Spatio-Temporal Pretext Learning for Self-supervised Video Representation

Spatio-temporal representation learning is critical for video self-supervised representation. Recent approaches mainly use contrastive learning and pretext tasks. However, these approaches learn representation by discriminating sampled…

Computer Vision and Pattern Recognition · Computer Science 2021-12-21 Yujia Zhang , Lai-Man Po , Xuyuan Xu , Mengyang Liu , Yexin Wang , Weifeng Ou , Yuzhi Zhao , Wing-Yin Yu

Learning to Perceive "Where": Spatial Pretext Tasks for Robust Self-Supervised Learning

Existing self-supervised learning (SSL) methods primarily learn object-invariant representations but often neglect the spatial structure and relationships among object parts. To address this limitation, we introduce Spatial Prediction (SP),…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Yang Shen , Yusen Cai , Weronika Hryniewska-Guzik , Qing Lin , Mengmi Zhang

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations,…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Pritam Sarkar , Ali Etemad

Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision

Modern self-supervised learning algorithms typically enforce persistency of instance representations across views. While being very effective on learning holistic image and video representations, such an objective becomes sub-optimal for…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Liangzhe Yuan , Rui Qian , Yin Cui , Boqing Gong , Florian Schroff , Ming-Hsuan Yang , Hartwig Adam , Ting Liu

Self-supervised Video Representation Learning by Pace Prediction

This paper addresses the problem of self-supervised video representation learning from a new perspective -- by video pace prediction. It stems from the observation that human visual system is sensitive to video pace, e.g., slow motion, a…

Computer Vision and Pattern Recognition · Computer Science 2020-09-07 Jiangliu Wang , Jianbo Jiao , Yun-Hui Liu

Learning to Answer Questions in Dynamic Audio-Visual Scenarios

In this paper, we focus on the Audio-Visual Question Answering (AVQA) task, which aims to answer questions regarding different visual objects, sounds, and their associations in videos. The problem requires comprehensive multimodal…

Computer Vision and Pattern Recognition · Computer Science 2022-04-06 Guangyao Li , Yake Wei , Yapeng Tian , Chenliang Xu , Ji-Rong Wen , Di Hu

Robust Audio-Visual Instance Discrimination

We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Pedro Morgado , Ishan Misra , Nuno Vasconcelos

ASCNet: Self-supervised Video Representation Learning with Appearance-Speed Consistency

We study self-supervised video representation learning, which is a challenging task due to 1) lack of labels for explicit supervision; 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Deng Huang , Wenhao Wu , Weiwen Hu , Xu Liu , Dongliang He , Zhihua Wu , Xiangmiao Wu , Mingkui Tan , Errui Ding

Audio-Visual Contrastive Learning with Temporal Self-Supervision

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also…

Computer Vision and Pattern Recognition · Computer Science 2023-02-16 Simon Jenni , Alexander Black , John Collomosse