English
Related papers

Related papers: Controllable Augmentations for Video Representatio…

200 papers

Self-supervised audio-visual learning aims to capture useful representations of video by leveraging correspondences between visual and audio inputs. Existing approaches have focused primarily on matching semantic information between the…

Computer Vision and Pattern Recognition · Computer Science 2020-06-15 Karren Yang , Bryan Russell , Justin Salamon

In recent years, creative content generations like style transfer and neural photo editing have attracted more and more attention. Among these, cartoonization of real-world scenes has promising applications in entertainment and industry.…

Computer Vision and Pattern Recognition · Computer Science 2022-04-05 Zhenhuan Liu , Liang Li , Huajie Jiang , Xin Jin , Dandan Tu , Shuhui Wang , Zheng-Jun Zha

We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural (multi-channel) audio…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Sagnik Majumder , Ziad Al-Halah , Kristen Grauman

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning…

Computer Vision and Pattern Recognition · Computer Science 2026-05-01 Ji-Hyeon Kim , Ho-Joong Kim , Seong-Whan Lee

The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a…

Computer Vision and Pattern Recognition · Computer Science 2023-07-21 Sara Sarto , Manuele Barraco , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of…

Computer Vision and Pattern Recognition · Computer Science 2023-03-10 Feng He , Qi Wang , Zhifan Feng , Wenbin Jiang , Yajuan Lv , Yong zhu , Xiao Tan

The past decade has witnessed notable achievements in self-supervised learning for video tasks. Recent efforts typically adopt the Masked Video Modeling (MVM) paradigm, leading to significant progress on multiple video tasks. However, two…

Computer Vision and Pattern Recognition · Computer Science 2025-03-20 Yang Liu , Qianqian Xu , Peisong Wen , Siran Dai , Qingming Huang

Recent multimodal models such as Contrastive Language-Image Pre-training (CLIP) have shown remarkable ability to align visual and linguistic representations. However, domains where small visual differences carry large semantic significance,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Hiroshi Sasaki

Contrastive vision-language models, such as CLIP, have garnered considerable attention for various downstream tasks, mainly due to the remarkable ability of the learned features for generalization. However, the features they learned often…

Computer Vision and Pattern Recognition · Computer Science 2025-04-24 Yichao Cai , Yuhang Liu , Zhen Zhang , Javen Qinfeng Shi

We study self-supervised video representation learning, which is a challenging task due to 1) lack of labels for explicit supervision; 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Deng Huang , Wenhao Wu , Weiwen Hu , Xu Liu , Dongliang He , Zhihua Wu , Xiangmiao Wu , Mingkui Tan , Errui Ding

This thesis focuses on video understanding for human action and interaction recognition. We start by identifying the main challenges related to action recognition from videos and review how they have been addressed by current methods. Based…

Computer Vision and Pattern Recognition · Computer Science 2021-10-06 Alexandros Stergiou

Unsupervised video-based object-centric learning is a promising avenue to learn structured representations from large, unlabeled video collections, but previous approaches have only managed to scale to real-world datasets in restricted…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Andrii Zadaianchuk , Maximilian Seitzer , Georg Martius

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank.…

Machine Learning · Computer Science 2020-07-02 Ting Chen , Simon Kornblith , Mohammad Norouzi , Geoffrey Hinton

Self-supervised contrastive learning (CL) has achieved state-of-the-art performance in representation learning by minimizing the distance between positive pairs while maximizing that of negative ones. Recently, it has been verified that the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Jin-Young Kim , Soonwoo Kwon , Hyojun Go , Yunsung Lee , Seungtaek Choi , Hyun-Gyoon Kim

Visual contrastive learning aims to learn representations by contrasting similar (positive) and dissimilar (negative) pairs of data samples. The design of these pairs significantly impacts representation quality, training efficiency, and…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Shasvat Desai , Debasmita Ghose , Deep Chakraborty

We introduce a self-supervised representation learning method based on the task of temporal alignment between videos. The method trains a network using temporal cycle consistency (TCC), a differentiable cycle-consistency loss that can be…

Computer Vision and Pattern Recognition · Computer Science 2019-04-17 Debidatta Dwibedi , Yusuf Aytar , Jonathan Tompson , Pierre Sermanet , Andrew Zisserman

In this work we address the challenging problem of unsupervised learning from videos. Existing methods utilize the spatio-temporal continuity in contiguous video frames as regularization for the learning process. Typically, this temporal…

Computer Vision and Pattern Recognition · Computer Science 2018-10-12 Carolina Redondo-Cabrera , Roberto J. López-Sastre

We present an approach to learn voice-face representations from the talking face videos, without any identity labels. Previous works employ cross-modal instance discrimination tasks to establish the correlation of voice and face. These…

Sound · Computer Science 2022-05-30 Boqing Zhu , Kele Xu , Changjian Wang , Zheng Qin , Tao Sun , Huaimin Wang , Yuxing Peng

MoCo is effective for unsupervised image representation learning. In this paper, we propose VideoMoCo for unsupervised video representation learning. Given a video sequence as an input sample, we improve the temporal feature representations…

Computer Vision and Pattern Recognition · Computer Science 2021-03-18 Tian Pan , Yibing Song , Tianyu Yang , Wenhao Jiang , Wei Liu

In the field of visual representation learning, performance of contrastive learning has been catching up with the supervised method which is commonly a classification convolutional neural network. However, most of the research work focuses…

Computer Vision and Pattern Recognition · Computer Science 2023-01-31 Xiaoqi Zhuang