English
Related papers

Related papers: Sequential Contrastive Audio-Visual Learning

200 papers

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning…

Machine Learning · Computer Science 2024-06-21 Jongsuk Kim , Hyeongkeun Lee , Kyeongha Rho , Junmo Kim , Joon Son Chung

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also…

Computer Vision and Pattern Recognition · Computer Science 2023-02-16 Simon Jenni , Alexander Black , John Collomosse

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower…

Machine Learning · Computer Science 2021-04-20 Shuang Ma , Zhaoyang Zeng , Daniel McDuff , Yale Song

To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-02 Jun Wang , Max W. Y. Lam , Dan Su , Dong Yu

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames…

Computer Vision and Pattern Recognition · Computer Science 2020-10-01 Jie Shao , Xin Wen , Bingchen Zhao , Xiangyang Xue

Recently, self-supervised representation learning gives further development in multimedia technology. Most existing self-supervised learning methods are applicable to packaged data. However, when it comes to streamed data, they are…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Zhiwei Lin , Yongtao Wang , Hongxiang Lin

We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Shaif Chowdhury , Mushfika Rahman , Greg Hamerly

The challenges in applying contrastive learning to speaker verification (SV) are that the softmax-based contrastive loss lacks discriminative power and that the hard negative pairs can easily influence learning. To overcome the first…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-14 Zhe Li , Man-Wai Mak , Helen Mei-Ling Meng

Contrastive learning is commonly used as a method of self-supervised learning with the "anchor" and "positive" being two random augmentations of a given input image, and the "negative" is the set of all other images. However, the…

Computer Vision and Pattern Recognition · Computer Science 2022-08-16 Rishab Balasubramanian , Kunal Rathore

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that…

Sound · Computer Science 2021-04-29 Luyu Wang , Pauline Luc , Adria Recasens , Jean-Baptiste Alayrac , Aaron van den Oord

Aiming at exploiting the rich information in user behaviour sequences, sequential recommendation has been widely adopted in real-world recommender systems. However, current methods suffer from the following issues: 1) sparsity of user-item…

Information Retrieval · Computer Science 2022-12-06 Yu Wang , Hengrui Zhang , Zhiwei Liu , Liangwei Yang , Philip S. Yu

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Jinxing Zhou , Dan Guo , Meng Wang

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Weixuan Sun , Jiayi Zhang , Jianyuan Wang , Zheyuan Liu , Yiran Zhong , Tianpeng Feng , Yandong Guo , Yanhao Zhang , Nick Barnes

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive…

The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Yang Liu , Ying Tan , Haoyuan Lan

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective…

Computer Vision and Pattern Recognition · Computer Science 2021-10-15 Haider Al-Tahan , Yalda Mohsenzadeh

A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in…

Computer Vision and Pattern Recognition · Computer Science 2021-01-28 Ting Yao , Yiheng Zhang , Zhaofan Qiu , Yingwei Pan , Tao Mei

Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful…

Computer Vision and Pattern Recognition · Computer Science 2021-08-06 Haofei Kuang , Yi Zhu , Zhi Zhang , Xinyu Li , Joseph Tighe , Sören Schwertfeger , Cyrill Stachniss , Mu Li

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing…

Sound · Computer Science 2025-05-21 Parthasaarathy Sudarsanam , Irene Martín-Morató , Tuomas Virtanen

The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet. These videos mostly contain a carefully-edited background audio track, where serious speech change, pitch shifting and…

Sound · Computer Science 2020-10-27 Zhesong Yu , Xingjian Du , Bilei Zhu , Zejun Ma
‹ Prev 1 2 3 10 Next ›