Related papers: Controllable Augmentations for Video Representatio…

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Jiangliu Wang , Jianbo Jiao , Linchao Bao , Shengfeng He , Yunhui Liu , Wei Liu

Multi-Scale Contrastive Learning for Video Temporal Grounding

Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding. To encode video moments of varying lengths, recent methods employ a multi-level…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Thong Thanh Nguyen , Yi Bin , Xiaobao Wu , Zhiyuan Hu , Cong-Duy T Nguyen , See-Kiong Ng , Anh Tuan Luu

Cross-Architecture Self-supervised Video Representation Learning

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning. CACL consists of a 3D CNN and a video transformer which are used in parallel to generate diverse…

Computer Vision and Pattern Recognition · Computer Science 2022-05-27 Sheng Guo , Zihua Xiong , Yujie Zhong , Limin Wang , Xiaobo Guo , Bing Han , Weilin Huang

Dual Contrastive Learning for Spatio-temporal Representation

Contrastive learning has shown promising potential in self-supervised spatio-temporal representation learning. Most works naively sample different clips to construct positive and negative pairs. However, we observe that this formulation…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Shuangrui Ding , Rui Qian , Hongkai Xiong

Robust image representations with counterfactual contrastive learning

Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Mélanie Roschewitz , Fabio De Sousa Ribeiro , Tian Xia , Galvin Khara , Ben Glocker

Learning from Untrimmed Videos: Self-Supervised Video Representation Learning with Hierarchical Consistency

Natural videos provide rich visual contents for self-supervised learning. Yet most existing approaches for learning spatio-temporal representations rely on manually trimmed videos, leading to limited diversity in visual patterns and limited…

Computer Vision and Pattern Recognition · Computer Science 2022-04-08 Zhiwu Qing , Shiwei Zhang , Ziyuan Huang , Yi Xu , Xiang Wang , Mingqian Tang , Changxin Gao , Rong Jin , Nong Sang

Spatial-then-Temporal Self-Supervised Learning for Video Correspondence

In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images or videos, using carefully…

Computer Vision and Pattern Recognition · Computer Science 2023-06-23 Rui Li , Dong Liu

TempCLR: Temporal Alignment Representation with Contrastive Learning

Video representation learning has been successful in video-text pre-training for zero-shot transfer, where each sentence is trained to be close to the paired video clips in a common feature space. For long videos, given a paragraph of…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Yuncong Yang , Jiawei Ma , Shiyuan Huang , Long Chen , Xudong Lin , Guangxing Han , Shih-Fu Chang

Contrastive Neural Processes for Self-Supervised Learning

Recent contrastive methods show significant improvement in self-supervised learning in several domains. In particular, contrastive methods are most effective where data augmentation can be easily constructed e.g. in computer vision.…

Machine Learning · Computer Science 2021-12-09 Konstantinos Kallidromitis , Denis Gudovskiy , Kazuki Kozuka , Iku Ohama , Luca Rigazio

Learning Temporally Invariant and Localizable Features via Data Augmentation for Video Recognition

Deep-Learning-based video recognition has shown promising improvements along with the development of large-scale datasets and spatiotemporal network architectures. In image recognition, learning spatially invariant features is a key factor…

Computer Vision and Pattern Recognition · Computer Science 2020-08-14 Taeoh Kim , Hyeongmin Lee , MyeongAh Cho , Ho Seong Lee , Dong Heon Cho , Sangyoun Lee

Temporally Consistent Object-Centric Learning by Contrasting Slots

Unsupervised object-centric learning from videos is a promising approach to extract structured representations from large, unlabeled collections of videos. To support downstream tasks like autonomous control, these representations must be…

Computer Vision and Pattern Recognition · Computer Science 2025-03-19 Anna Manasyan , Maximilian Seitzer , Filip Radovic , Georg Martius , Andrii Zadaianchuk

Learning Cross-modal Contrastive Features for Video Domain Adaptation

Learning transferable and domain adaptive feature representations from videos is important for video-relevant tasks such as action recognition. Existing video domain adaptation methods mainly rely on adversarial feature alignment, which has…

Computer Vision and Pattern Recognition · Computer Science 2021-08-30 Donghyun Kim , Yi-Hsuan Tsai , Bingbing Zhuang , Xiang Yu , Stan Sclaroff , Kate Saenko , Manmohan Chandraker

Pretext-Contrastive Learning: Toward Good Practices in Self-supervised Video Representation Leaning

Recently, pretext-task based methods are proposed one after another in self-supervised video feature learning. Meanwhile, contrastive learning methods also yield good performance. Usually, new methods can beat previous ones as claimed that…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Li Tao , Xueting Wang , Toshihiko Yamasaki

Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval

Attempt to fully discover the temporal diversity and chronological characteristics for self-supervised video representation learning, this work takes advantage of the temporal dependencies within videos and further proposes a novel…

Computer Vision and Pattern Recognition · Computer Science 2021-03-18 Yang Liu , Keze Wang , Haoyuan Lan , Liang Lin

Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization

We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn…

Computer Vision and Pattern Recognition · Computer Science 2023-09-29 Fida Mohammad Thoker , Hazel Doughty , Cees Snoek

CoViews: Adaptive Augmentation Using Cooperative Views for Enhanced Contrastive Learning

Data augmentation plays a critical role in generating high-quality positive and negative pairs necessary for effective contrastive learning. However, common practices involve using a single augmentation policy repeatedly to generate…

Computer Vision and Pattern Recognition · Computer Science 2024-05-14 Nazim Bendib

Learning Visual Composition through Improved Semantic Guidance

Visual imagery does not consist of solitary objects, but instead reflects the composition of a multitude of fluid concepts. While there have been great advances in visual representation learning, such advances have focused on building…

Computer Vision and Pattern Recognition · Computer Science 2025-04-07 Austin Stone , Hagen Soltau , Robert Geirhos , Xi Yi , Ye Xia , Bingyi Cao , Kaifeng Chen , Abhijit Ogale , Jonathon Shlens

Self-supervised Representation Learning Framework for Remote Physiological Measurement Using Spatiotemporal Augmentation Loss

Recent advances in supervised deep learning methods are enabling remote measurements of photoplethysmography-based physiological signals using facial videos. The performance of these supervised methods, however, are dependent on the…

Computer Vision and Pattern Recognition · Computer Science 2021-12-15 Hao Wang , Euijoon Ahn , Jinman Kim

Video Understanding: Through A Temporal Lens

This thesis explores the central question of how to leverage temporal relations among video elements to advance video understanding. Addressing the limitations of existing methods, the work presents a five-fold contribution: (1) an…

Computer Vision and Pattern Recognition · Computer Science 2026-04-06 Thong Thanh Nguyen

Self-supervised Temporal Discriminative Learning for Video Representation Learning

Temporal cues in videos provide important information for recognizing actions accurately. However, temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training. This paper…

Computer Vision and Pattern Recognition · Computer Science 2020-08-06 Jinpeng Wang , Yiqi Lin , Andy J. Ma , Pong C. Yuen