Related papers: Sequential Contrastive Audio-Visual Learning

EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning

Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning…

Machine Learning · Computer Science 2024-06-21 Jongsuk Kim , Hyeongkeun Lee , Kyeongha Rho , Junmo Kim , Joon Son Chung

Audio-Visual Contrastive Learning with Temporal Self-Supervision

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also…

Computer Vision and Pattern Recognition · Computer Science 2023-02-16 Simon Jenni , Alexander Black , John Collomosse

Active Contrastive Learning of Audio-Visual Video Representations

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower…

Machine Learning · Computer Science 2021-04-20 Shuang Ma , Zhaoyang Zeng , Daniel McDuff , Yale Song

Contrastive Separative Coding for Self-supervised Representation Learning

To extract robust deep representations from long sequential modeling of speech data, we propose a self-supervised learning approach, namely Contrastive Separative Coding (CSC). Our key finding is to learn such representations by separating…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-02 Jun Wang , Max W. Y. Lam , Dan Su , Dong Yu

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames…

Computer Vision and Pattern Recognition · Computer Science 2020-10-01 Jie Shao , Xin Wen , Bingchen Zhao , Xiangyang Xue

Continual Contrastive Learning for Image Classification

Recently, self-supervised representation learning gives further development in multimedia technology. Most existing self-supervised learning methods are applicable to packaged data. However, when it comes to streamed data, they are…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Zhiwei Lin , Yongtao Wang , Hongxiang Lin

Supervised Contrastive Frame Aggregation for Video Representation Learning

We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Shaif Chowdhury , Mushfika Rahman , Greg Hamerly

Discriminative Speaker Representation via Contrastive Learning with Class-Aware Attention in Angular Space

The challenges in applying contrastive learning to speaker verification (SV) are that the softmax-based contrastive loss lacks discriminative power and that the hard negative pairs can easily influence learning. To overcome the first…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-14 Zhe Li , Man-Wai Mak , Helen Mei-Ling Meng

Contrastive Learning for Object Detection

Contrastive learning is commonly used as a method of self-supervised learning with the "anchor" and "positive" being two random augmentations of a given input image, and the "negative" is the set of all other images. However, the…

Computer Vision and Pattern Recognition · Computer Science 2022-08-16 Rishab Balasubramanian , Kunal Rathore

Multimodal Self-Supervised Learning of General Audio Representations

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that…

Sound · Computer Science 2021-04-29 Luyu Wang , Pauline Luc , Adria Recasens , Jean-Baptiste Alayrac , Aaron van den Oord

ContrastVAE: Contrastive Variational AutoEncoder for Sequential Recommendation

Aiming at exploiting the rich information in user behaviour sequences, sequential recommendation has been widely adopted in real-world recommender systems. However, current methods suffer from the following issues: 1) sparsity of user-item…

Information Retrieval · Computer Science 2022-12-06 Yu Wang , Hengrui Zhang , Zhiwei Liu , Liangwei Yang , Philip S. Yu

Contrastive Positive Sample Propagation along the Audio-Visual Event Line

Visual and audio signals often coexist in natural environments, forming audio-visual events (AVEs). Given a video, we aim to localize video segments containing an AVE and identify its category. It is pivotal to learn the discriminative…

Computer Vision and Pattern Recognition · Computer Science 2022-11-21 Jinxing Zhou , Dan Guo , Meng Wang

Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning

Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Weixuan Sun , Jiayi Zhang , Jianyuan Wang , Zheyuan Liu , Yiran Zhong , Tianpeng Feng , Yandong Guo , Yanhao Zhang , Nick Barnes

Contrastive Audio-Visual Masked Autoencoder

In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE) by combining contrastive…

Multimedia · Computer Science 2023-04-13 Yuan Gong , Andrew Rouditchenko , Alexander H. Liu , David Harwath , Leonid Karlinsky , Hilde Kuehne , James Glass

Self-supervised Contrastive Learning for Audio-Visual Action Recognition

The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Yang Liu , Ying Tan , Haoyuan Lan

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective…

Computer Vision and Pattern Recognition · Computer Science 2021-10-15 Haider Al-Tahan , Yalda Mohsenzadeh

SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning

A steady momentum of innovations and breakthroughs has convincingly pushed the limits of unsupervised image representation learning. Compared to static 2D images, video has one more dimension (time). The inherent supervision existing in…

Computer Vision and Pattern Recognition · Computer Science 2021-01-28 Ting Yao , Yiheng Zhang , Zhaofan Qiu , Yingwei Pan , Tao Mei

Video Contrastive Learning with Global Context

Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful…

Computer Vision and Pattern Recognition · Computer Science 2021-08-06 Haofei Kuang , Yi Zhu , Zhi Zhang , Xinyu Li , Joseph Tighe , Sören Schwertfeger , Cyrill Stachniss , Mu Li

Representation Learning for Semantic Alignment of Language, Audio, and Visual Modalities

This paper proposes a single-stage training approach that semantically aligns three modalities - audio, visual, and text using a contrastive learning framework. Contrastive training has gained prominence for multimodal alignment, utilizing…

Sound · Computer Science 2025-05-21 Parthasaarathy Sudarsanam , Irene Martín-Morató , Tuomas Virtanen

Contrastive Unsupervised Learning for Audio Fingerprinting

The rise of video-sharing platforms has attracted more and more people to shoot videos and upload them to the Internet. These videos mostly contain a carefully-edited background audio track, where serious speech change, pitch shifting and…

Sound · Computer Science 2020-10-27 Zhesong Yu , Xingjian Du , Bilei Zhu , Zejun Ma