Related papers: Controllable Augmentations for Video Representatio…

Object-centric Binding in Contrastive Language-Image Pretraining

Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual information with their corresponding text descriptions. However, these models have limitations in…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Rim Assouel , Pietro Astolfi , Florian Bordes , Michal Drozdzal , Adriana Romero-Soriano

Provable Guarantees for Self-Supervised Deep Learning with Spectral Contrastive Loss

Recent works in self-supervised learning have advanced the state-of-the-art by relying on the contrastive learning paradigm, which learns representations by pushing positive pairs, or similar examples from the same class, closer together…

Machine Learning · Computer Science 2022-06-27 Jeff Z. HaoChen , Colin Wei , Adrien Gaidon , Tengyu Ma

Learning Street View Representations with Spatiotemporal Contrast

Street view imagery is extensively utilized in representation learning for urban visual environments, supporting various sustainable development tasks such as environmental perception and socio-economic assessment. However, it is…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Yong Li , Yingjing Huang , Gengchen Mai , Fan Zhang

Video Acceleration Magnification

The ability to amplify or reduce subtle image changes over time is useful in contexts such as video editing, medical video analysis, product quality control and sports. In these contexts there is often large motion present which severely…

Computer Vision and Pattern Recognition · Computer Science 2017-04-25 Yichao Zhang , Silvia L. Pintea , Jan C. van Gemert

Contrastive Language-Action Pre-training for Temporal Localization

Long-form video understanding requires designing approaches that are able to temporally localize activities or language. End-to-end training for such tasks is limited by the compute device memory constraints and lack of temporal annotations…

Computer Vision and Pattern Recognition · Computer Science 2022-04-27 Mengmeng Xu , Erhan Gundogdu , Maksim Lapin , Bernard Ghanem , Michael Donoser , Loris Bazzani

Frequency Selective Augmentation for Video Representation Learning

Recent self-supervised video representation learning methods focus on maximizing the similarity between multiple augmented views from the same video and largely rely on the quality of generated views. However, most existing methods lack a…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Jinhyung Kim , Taeoh Kim , Minho Shim , Dongyoon Han , Dongyoon Wee , Junmo Kim

Find, Fix, Reason: Context Repair for Video Reasoning

Reinforcement learning has advanced video reasoning in large multi-modal models, yet dominant pipelines either rely on on-policy self-exploration, which plateaus at the model's knowledge boundary, or hybrid replay that mixes policies and…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Haojian Huang , Chuanyu Qin , Yinchuan Li , Yingcong Chen

Seeing Fast and Slow: Learning the Flow of Time in Videos

How can we tell whether a video has been sped up or slowed down? How can we generate videos at different speeds? Although videos have been central to modern computer vision research, little attention has been paid to perceiving and…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Yen-Siang Wu , Rundong Luo , Jingsen Zhu , Tao Tu , Ali Farhadi , Matthew Wallingford , Yu-Chiang Frank Wang , Steve Marschner , Wei-Chiu Ma

Revisiting Contrastive Methods for Unsupervised Learning of Visual Representations

Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection. However, current methods are still primarily applied to curated datasets like ImageNet. In this…

Computer Vision and Pattern Recognition · Computer Science 2021-12-15 Wouter Van Gansbeke , Simon Vandenhende , Stamatios Georgoulis , Luc Van Gool

On Compositions of Transformations in Contrastive Self-Supervised Learning

In the image domain, excellent representations can be learned by inducing invariance to content-preserving transformations via noise contrastive learning. In this paper, we generalize contrastive learning to a wider set of transformations,…

Computer Vision and Pattern Recognition · Computer Science 2021-10-28 Mandela Patrick , Yuki M. Asano , Polina Kuznetsova , Ruth Fong , João F. Henriques , Geoffrey Zweig , Andrea Vedaldi

Aligning Videos in Space and Time

In this paper, we focus on the task of extracting visual correspondences across videos. Given a query video clip from an action class, we aim to align it with training videos in space and time. Obtaining training data for such a…

Computer Vision and Pattern Recognition · Computer Science 2020-07-10 Senthil Purushwalkam , Tian Ye , Saurabh Gupta , Abhinav Gupta

Time to augment self-supervised visual representation learning

Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an…

Machine Learning · Computer Science 2022-12-22 Arthur Aubret , Markus Ernst , Céline Teulière , Jochen Triesch

A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning

We present a large-scale study on unsupervised spatiotemporal representation learning from videos. With a unified perspective on four recent image-based frameworks, we study a simple objective that can easily generalize all these methods to…

Computer Vision and Pattern Recognition · Computer Science 2021-04-30 Christoph Feichtenhofer , Haoqi Fan , Bo Xiong , Ross Girshick , Kaiming He

Cycle-Contrast for Self-Supervised Video Representation Learning

We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation. Following a nature that there is a belong and inclusion relation of video and its frames, CCL is designed to find correspondences…

Computer Vision and Pattern Recognition · Computer Science 2020-10-29 Quan Kong , Wenpeng Wei , Ziwei Deng , Tomoaki Yoshinaga , Tomokazu Murakami

Self-supervised video pretraining yields robust and more human-aligned visual representations

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for…

Computer Vision and Pattern Recognition · Computer Science 2025-01-13 Nikhil Parthasarathy , S. M. Ali Eslami , João Carreira , Olivier J. Hénaff

Temporal Context Aggregation for Video Retrieval with Contrastive Learning

The current research focus on Content-Based Video Retrieval requires higher-level video representation describing the long-range semantic dependencies of relevant incidents, events, etc. However, existing methods commonly process the frames…

Computer Vision and Pattern Recognition · Computer Science 2020-10-01 Jie Shao , Xin Wen , Bingchen Zhao , Xiangyang Xue

Adaptive Data Augmentation for Contrastive Learning

In computer vision, contrastive learning is the most advanced unsupervised learning framework. Yet most previous methods simply apply fixed composition of data augmentations to improve data efficiency, which ignores the changes in their…

Computer Vision and Pattern Recognition · Computer Science 2023-04-20 Yuhan Zhang , He Zhu , Shan Yu

Class-Incremental Learning with CLIP: Adaptive Representation Adjustment and Parameter Fusion

Class-incremental learning is a challenging problem, where the goal is to train a model that can classify data from an increasing number of classes over time. With the advancement of vision-language pre-trained models such as CLIP, they…

Computer Vision and Pattern Recognition · Computer Science 2024-07-22 Linlan Huang , Xusheng Cao , Haori Lu , Xialei Liu

Self-Supervised Representation Learning for Visual Anomaly Detection

Self-supervised learning allows for better utilization of unlabelled data. The feature representation obtained by self-supervision can be used in downstream tasks such as classification, object detection, segmentation, and anomaly…

Computer Vision and Pattern Recognition · Computer Science 2020-06-18 Rabia Ali , Muhammad Umar Karim Khan , Chong Min Kyung

Alignment-guided Temporal Attention for Video Action Recognition

Temporal modeling is crucial for various video learning tasks. Most recent approaches employ either factorized (2D+1D) or joint (3D) spatial-temporal operations to extract temporal contexts from the input frames. While the former is more…

Computer Vision and Pattern Recognition · Computer Science 2023-01-03 Yizhou Zhao , Zhenyang Li , Xun Guo , Yan Lu