Related papers: Controllable Augmentations for Video Representatio…

Audio-Visual Instance Discrimination with Cross-Modal Agreement

We present a self-supervised learning approach to learn audio-visual representations from video and audio. Our method uses contrastive learning for cross-modal discrimination of video from audio and vice-versa. We show that optimizing for…

Computer Vision and Pattern Recognition · Computer Science 2021-03-31 Pedro Morgado , Nuno Vasconcelos , Ishan Misra

CoCon: Cooperative-Contrastive Learning

Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggests contrastive learning is a promising…

Computer Vision and Pattern Recognition · Computer Science 2021-05-03 Nishant Rai , Ehsan Adeli , Kuan-Hui Lee , Adrien Gaidon , Juan Carlos Niebles

Learning Object-Centric Video Models by Contrasting Sets

Contrastive, self-supervised learning of object representations recently emerged as an attractive alternative to reconstruction-based training. Prior approaches focus on contrasting individual object representations (slots) against one…

Computer Vision and Pattern Recognition · Computer Science 2020-11-23 Sindy Löwe , Klaus Greff , Rico Jonschkowski , Alexey Dosovitskiy , Thomas Kipf

Self-supervised pre-training and contrastive representation learning for multiple-choice video QA

Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions. In this paper, we propose novel training schemes for multiple-choice video question answering with…

Computation and Language · Computer Science 2020-12-15 Seonhoon Kim , Seohyeong Jeong , Eunbyul Kim , Inho Kang , Nojun Kwak

Video Representation Learning with Visual Tempo Consistency

Visual tempo, which describes how fast an action goes, has shown its potential in supervised action recognition. In this work, we demonstrate that visual tempo can also serve as a self-supervision signal for video representation learning.…

Computer Vision and Pattern Recognition · Computer Science 2020-12-21 Ceyuan Yang , Yinghao Xu , Bo Dai , Bolei Zhou

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

Generalist Vision-Language-Action models are currently hindered by the scarcity of robotic data compared to the abundance of human video demonstrations. Existing Latent Action Models attempt to leverage video data but often suffer from…

Robotics · Computer Science 2026-01-08 Chubin Zhang , Jianan Wang , Zifeng Gao , Yue Su , Tianru Dai , Cai Zhou , Jiwen Lu , Yansong Tang

Implicit Temporal Modeling with Learnable Alignment for Video Recognition

Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Shuyuan Tu , Qi Dai , Zuxuan Wu , Zhi-Qi Cheng , Han Hu , Yu-Gang Jiang

An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time Video Enhancement

Video enhancement is a challenging problem, more than that of stills, mainly due to high computational cost, larger data volumes and the difficulty of achieving consistency in the spatio-temporal domain. In practice, these challenges are…

Image and Video Processing · Electrical Eng. & Systems 2022-12-13 Dario Fuoli , Zhiwu Huang , Danda Pani Paudel , Luc Van Gool , Radu Timofte

Multi-scale 2D Representation Learning for weakly-supervised moment retrieval

Video moment retrieval aims to search the moment most relevant to a given language query. However, most existing methods in this community often require temporal boundary annotations which are expensive and time-consuming to label. Hence…

Computer Vision and Pattern Recognition · Computer Science 2021-11-05 Ding Li , Rui Wu , Yongqiang Tang , Zhizhong Zhang , Wensheng Zhang

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce impressive transferable visual representations by learning to be invariant to different data augmentations. However, these methods implicitly assume a particular set of…

Computer Vision and Pattern Recognition · Computer Science 2021-03-22 Tete Xiao , Xiaolong Wang , Alexei A. Efros , Trevor Darrell

Feature Augmentation for Self-supervised Contrastive Learning: A Closer Look

Self-supervised contrastive learning heavily relies on the view variance brought by data augmentation, so that it can learn a view-invariant pre-trained representation. Beyond increasing the view variance for contrast, this work focuses on…

Computer Vision and Pattern Recognition · Computer Science 2024-10-17 Yong Zhang , Rui Zhu , Shifeng Zhang , Xu Zhou , Shifeng Chen , Xiaofan Chen

Probabilistic Representations for Video Contrastive Learning

This paper presents Probabilistic Video Contrastive Learning, a self-supervised representation learning method that bridges contrastive learning with probabilistic representation. We hypothesize that the clips composing the video have…

Computer Vision and Pattern Recognition · Computer Science 2022-04-11 Jungin Park , Jiyoung Lee , Ig-Jae Kim , Kwanghoon Sohn

Video Playback Rate Perception for Self-supervisedSpatio-Temporal Representation Learning

In self-supervised spatio-temporal representation learning, the temporal resolution and long-short term characteristics are not yet fully explored, which limits representation capabilities of learned models. In this paper, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Yuan Yao , Chang Liu , Dezhao Luo , Yu Zhou , Qixiang Ye

Adversarial Framework for Unsupervised Learning of Motion Dynamics in Videos

Human behavior understanding in videos is a complex, still unsolved problem and requires to accurately model motion at both the local (pixel-wise dense prediction) and global (aggregation of motion cues) levels. Current approaches based on…

Computer Vision and Pattern Recognition · Computer Science 2019-09-19 C. Spampinato , S. Palazzo , P. D'Oro , D. Giordano , M. Shah

Audiovisual representation learning typically relies on the correspondence between sight and sound. However, there are often multiple audio tracks that can correspond with a visual scene. Consider, for example, different conversations on…

Sound · Computer Science 2024-06-11 Nikhil Singh , Chih-Wei Wu , Iroro Orife , Mahdi Kalayeh

Temporal Contrastive Learning with Curriculum

We present ConCur, a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy in contrastive training. More specifically, ConCur starts the contrastive training with easy positive…

Computer Vision and Pattern Recognition · Computer Science 2022-09-05 Shuvendu Roy , Ali Etemad

Diversified Augmentation with Domain Adaptation for Debiased Video Temporal Grounding

Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed to the uneven temporal distributions of target moments. Existing methods generate…

Computer Vision and Pattern Recognition · Computer Science 2025-01-15 Junlong Ren , Gangjian Zhang , Haifeng Sun , Hao Wang

TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition

Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Ishan Rajendrakumar Dave , Mamshad Nayeem Rizve , Chen Chen , Mubarak Shah

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be…

Computer Vision and Pattern Recognition · Computer Science 2020-08-18 Jianbo Jiao , Yifan Cai , Mohammad Alsharid , Lior Drukker , Aris T. Papageorghiou , J. Alison Noble

From Frames to Clips: Training-free Adaptive Key Clip Selection for Long-Form Video Understanding

Video Large Language Models (VLMs) have achieved strong performance on various vision-language tasks, yet their practical use is limited by the massive number of visual tokens produced from raw video frames, which quickly exhausts the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Guangyu Sun , Archit Singhal , Burak Uzkent , Mubarak Shah , Chen Chen , Garin Kessler