Related papers: Controllable Augmentations for Video Representatio…

Beyond Short Clips: End-to-End Video-Level Learning with Collaborative Memories

The standard way of training video models entails sampling at each iteration a single clip from a video and optimizing the clip prediction with respect to the video-level label. We argue that a single clip may not have enough temporal…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Xitong Yang , Haoqi Fan , Lorenzo Torresani , Larry Davis , Heng Wang

Improved baselines for vision-language pre-training

Contrastive learning has emerged as an efficient framework to learn multimodal representations. CLIP, a seminal work in this area, achieved impressive results by training on paired image-text data using the contrastive loss. Recent work…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Enrico Fini , Pietro Astolfi , Adriana Romero-Soriano , Jakob Verbeek , Michal Drozdzal

Learning Temporal Embeddings for Complex Video Analysis

In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are…

Computer Vision and Pattern Recognition · Computer Science 2015-05-05 Vignesh Ramanathan , Kevin Tang , Greg Mori , Li Fei-Fei

Multi-network Contrastive Learning Based on Global and Local Representations

The popularity of self-supervised learning has made it possible to train models without relying on labeled data, which saves expensive annotation costs. However, most existing self-supervised contrastive learning methods often overlook the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Weiquan Li , Xianzhong Long , Yun Li

Identity-Consistent Video Generation under Large Facial-Angle Variations

Single-view reference-to-video methods often struggle to preserve identity consistency under large facial-angle variations. This limitation naturally motivates the incorporation of multi-view facial references. However, simply introducing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Bin Hu , Zipeng Qi , Guoxi Huang , Zunnan Xu , Ruicheng Zhang , Chongjie Ye , Jun Zhou , Xiu Li , Jingdong Wang

Online Object Representations with Contrastive Learning

We propose a self-supervised approach for learning representations of objects from monocular videos and demonstrate it is particularly useful in situated settings such as robotics. The main contributions of this paper are: 1) a…

Computer Vision and Pattern Recognition · Computer Science 2019-06-12 Sören Pirk , Mohi Khansari , Yunfei Bai , Corey Lynch , Pierre Sermanet

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

This work presents a self-supervised learning framework named TeG to explore Temporal Granularity in learning video representations. In TeG, we sample a long clip from a video and a short clip that lies inside the long clip. We then extract…

Computer Vision and Pattern Recognition · Computer Science 2021-12-09 Rui Qian , Yeqing Li , Liangzhe Yuan , Boqing Gong , Ting Liu , Matthew Brown , Serge Belongie , Ming-Hsuan Yang , Hartwig Adam , Yin Cui

Contrastive Learning of Image Representations with Cross-Video Cycle-Consistency

Recent works have advanced the performance of self-supervised representation learning by a large margin. The core among these methods is intra-image invariance learning. Two different transformations of one image instance are considered as…

Computer Vision and Pattern Recognition · Computer Science 2021-05-14 Haiping Wu , Xiaolong Wang

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

Deep neural networks require collecting and annotating large amounts of data to train successfully. In order to alleviate the annotation bottleneck, we propose a novel self-supervised representation learning approach for spatiotemporal…

Computer Vision and Pattern Recognition · Computer Science 2019-10-29 Alaaeldin El-Nouby , Shuangfei Zhai , Graham W. Taylor , Joshua M. Susskind

Counterfactual contrastive learning: robust representations via causal image synthesis

Contrastive pretraining is well-known to improve downstream task performance and model generalisation, especially in limited label settings. However, it is sensitive to the choice of augmentation pipeline. Positive pairs should preserve…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Melanie Roschewitz , Fabio De Sousa Ribeiro , Tian Xia , Galvin Khara , Ben Glocker

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Dazhao Du , Jian Liu , Jialong Qin , Tao Han , Bohai Gu , Fangqi Zhu , Yujia Zhang , Eric Liu , Xi Chen , Song Guo

Learning Robust Video Synchronization without Annotations

Aligning video sequences is a fundamental yet still unsolved component for a broad range of applications in computer graphics and vision. Most classical image processing methods cannot be directly applied to related video problems due to…

Computer Vision and Pattern Recognition · Computer Science 2017-09-19 Patrick Wieschollek , Ido Freeman , Hendrik P. A. Lensch

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample,…

Computer Vision and Pattern Recognition · Computer Science 2021-01-15 Mandela Patrick , Po-Yao Huang , Yuki Asano , Florian Metze , Alexander Hauptmann , João Henriques , Andrea Vedaldi

Parametric Augmentation for Time Series Contrastive Learning

Modern techniques like contrastive learning have been effectively used in many areas, including computer vision, natural language processing, and graph-structured data. Creating positive examples that assist the model in learning robust and…

Machine Learning · Computer Science 2024-02-19 Xu Zheng , Tianchun Wang , Wei Cheng , Aitian Ma , Haifeng Chen , Mo Sha , Dongsheng Luo

Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition

We propose a self-supervised learning method to jointly reason about spatial and temporal context for video recognition. Recent self-supervised approaches have used spatial context [9, 34] as well as temporal coherency [32] but a…

Computer Vision and Pattern Recognition · Computer Science 2018-08-24 Unaiza Ahsan , Rishi Madhok , Irfan Essa

Language-based Action Concept Spaces Improve Video Self-Supervised Learning

Recent contrastive language image pre-training has led to learning highly transferable and robust image representations. However, adapting these models to video domains with minimal supervision remains an open problem. We explore a simple…

Computer Vision and Pattern Recognition · Computer Science 2023-10-27 Kanchana Ranasinghe , Michael Ryoo

Improving Contrastive Learning with Model Augmentation

The sequential recommendation aims at predicting the next items in user behaviors, which can be solved by characterizing item relationships in sequences. Due to the data sparsity and noise issues in sequences, a new self-supervised learning…

Machine Learning · Computer Science 2022-03-30 Zhiwei Liu , Yongjun Chen , Jia Li , Man Luo , Philip S. Yu , Caiming Xiong

Broaden Your Views for Self-Supervised Video Learning

Most successful self-supervised learning methods are trained to align the representations of two independent views from the data. State-of-the-art methods in video are inspired by image techniques, where these two views are similarly…

Computer Vision and Pattern Recognition · Computer Science 2021-10-20 Adrià Recasens , Pauline Luc , Jean-Baptiste Alayrac , Luyu Wang , Ross Hemsley , Florian Strub , Corentin Tallec , Mateusz Malinowski , Viorica Patraucean , Florent Altché , Michal Valko , Jean-Bastien Grill , Aäron van den Oord , Andrew Zisserman

Video Representation Learning by Recognizing Temporal Transformations

We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial…

Computer Vision and Pattern Recognition · Computer Science 2020-07-22 Simon Jenni , Givi Meishvili , Paolo Favaro

Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation

In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to…

Computer Vision and Pattern Recognition · Computer Science 2022-11-28 Sina Honari , Victor Constantin , Helge Rhodin , Mathieu Salzmann , Pascal Fua