Related papers: Controllable Augmentations for Video Representatio…

ViewMix: Augmentation for Robust Representation in Self-Supervised Learning

Joint Embedding Architecture-based self-supervised learning methods have attributed the composition of data augmentations as a crucial factor for their strong representation learning capabilities. While regional dropout strategies have…

Computer Vision and Pattern Recognition · Computer Science 2023-09-08 Arjon Das , Xin Zhong

Towards Principled Representation Learning from Videos for Reinforcement Learning

We study pre-training representations for decision-making using video data, which is abundantly available for tasks such as game agents and software testing. Even though significant empirical advances have been made on this problem, a…

Machine Learning · Computer Science 2024-03-21 Dipendra Misra , Akanksha Saran , Tengyang Xie , Alex Lamb , John Langford

Contrastive Transformation for Self-supervised Correspondence Learning

In this paper, we focus on the self-supervised learning of visual correspondence using unlabeled videos in the wild. Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence…

Computer Vision and Pattern Recognition · Computer Science 2020-12-10 Ning Wang , Wengang Zhou , Houqiang Li

Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence

The de facto approach in video object-centric learning maintains temporal consistency through learned dynamics modules that predict future object representations, called slots. We demonstrate that these predictors function as expensive…

Computer Vision and Pattern Recognition · Computer Science 2026-05-12 Zhiyuan Li , Rongzhen Zhao , Wenyan Yang , Wenshuai Zhao , Pekka Marttinen , Joni Pajarinen

Temporal Perceiving Video-Language Pre-training

Video-Language Pre-training models have recently significantly improved various multi-modal downstream tasks. Previous dominant works mainly adopt contrastive learning to achieve global feature alignment across modalities. However, the…

Computer Vision and Pattern Recognition · Computer Science 2023-01-19 Fan Ma , Xiaojie Jin , Heng Wang , Jingjia Huang , Linchao Zhu , Jiashi Feng , Yi Yang

Temporal Graph Representation Learning with Adaptive Augmentation Contrastive

Temporal graph representation learning aims to generate low-dimensional dynamic node embeddings to capture temporal information as well as structural and property information. Current representation learning methods for temporal networks…

Machine Learning · Computer Science 2023-11-08 Hongjiang Chen , Pengfei Jiao , Huijun Tang , Huaming Wu

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich…

Computer Vision and Pattern Recognition · Computer Science 2021-10-22 Martine Toering , Ioannis Gatopoulos , Maarten Stol , Vincent Tao Hu

Hierarchical Contrastive Learning with Multiple Augmentation for Sequential Recommendation

Sequential recommendation addresses the issue of preference drift by predicting the next item based on the user's previous behaviors. Recently, a promising approach using contrastive learning has emerged, demonstrating its effectiveness in…

Information Retrieval · Computer Science 2023-08-08 Dongjun Lee , Donggeun Ko , Jaekwang Kim

Hierarchical Contrastive Motion Learning for Video Action Recognition

One central question for video action recognition is how to model motion. In this paper, we present hierarchical contrastive motion learning, a new self-supervised learning framework to extract effective motion representations from raw…

Computer Vision and Pattern Recognition · Computer Science 2022-01-19 Xitong Yang , Xiaodong Yang , Sifei Liu , Deqing Sun , Larry Davis , Jan Kautz

Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows

The abundance and ease of utilizing sound, along with the fact that auditory clues reveal so much about what happens in the scene, make the audio-visual space a perfectly intuitive choice for self-supervised representation learning.…

Computer Vision and Pattern Recognition · Computer Science 2021-06-17 Mahdi M. Kalayeh , Nagendra Kamath , Lingyi Liu , Ashok Chandrashekar

CLAR: Contrastive Learning of Auditory Representations

Learning rich visual representations using contrastive self-supervised learning has been extremely successful. However, it is still a major question whether we could use a similar approach to learn superior auditory representations. In this…

Sound · Computer Science 2020-10-20 Haider Al-Tahan , Yalda Mohsenzadeh

End-To-End Trainable Video Super-Resolution Based on a New Mechanism for Implicit Motion Estimation and Compensation

Video super-resolution aims at generating a high-resolution video from its low-resolution counterpart. With the rapid rise of deep learning, many recently proposed video super-resolution methods use convolutional neural networks in…

Computer Vision and Pattern Recognition · Computer Science 2020-01-07 Xiaohong Liu , Lingshi Kong , Yang Zhou , Jiying Zhao , Jun Chen

Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs)…

Robotics · Computer Science 2026-02-12 Songen Gu , Yunuo Cai , Tianyu Wang , Simo Wu , Yanwei Fu

Self-supervised Contrastive Learning for Implicit Collaborative Filtering

Contrastive learning-based recommendation algorithms have significantly advanced the field of self-supervised recommendation, particularly with BPR as a representative ranking prediction task that dominates implicit collaborative filtering.…

Information Retrieval · Computer Science 2024-03-13 Shipeng Song , Bin Liu , Fei Teng , Tianrui Li

Learning Customized Visual Models with Retrieval-Augmented Knowledge

Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept…

Computer Vision and Pattern Recognition · Computer Science 2023-01-18 Haotian Liu , Kilho Son , Jianwei Yang , Ce Liu , Jianfeng Gao , Yong Jae Lee , Chunyuan Li

Automatic Data Augmentation Selection and Parametrization in Contrastive Self-Supervised Speech Representation Learning

Contrastive learning enables learning useful audio and speech representations without ground-truth labels by maximizing the similarity between latent representations of similar signal segments. In this framework various data augmentation…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-11 Salah Zaiem , Titouan Parcollet , Slim Essid

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

Video understanding relies on perceiving the global content and modeling its internal connections (e.g., causality, movement, and spatio-temporal correspondence). To learn these interactions, we apply a mask-then-predict pre-training task…

Computer Vision and Pattern Recognition · Computer Science 2021-06-22 Hao Tan , Jie Lei , Thomas Wolf , Mohit Bansal

An Augmentation Overlap Theory of Contrastive Learning

Recently, self-supervised contrastive learning has achieved great success on various tasks. However, its underlying working mechanism is yet unclear. In this paper, we first provide the tightest bounds based on the widely adopted assumption…

Machine Learning · Computer Science 2025-11-06 Qi Zhang , Yifei Wang , Yisen Wang

Contrastive Domain Adaptation

Recently, contrastive self-supervised learning has become a key component for learning visual representations across many computer vision tasks and benchmarks. However, contrastive learning in the context of domain adaptation remains…

Computer Vision and Pattern Recognition · Computer Science 2021-06-25 Mamatha Thota , Georgios Leontidis

Generative Spatiotemporal Data Augmentation

We explore spatiotemporal data augmentation using video foundation models to diversify both camera viewpoints and scene dynamics. Unlike existing approaches based on simple geometric transforms or appearance perturbations, our method…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Jinfan Zhou , Lixin Luo , Sungmin Eum , Heesung Kwon , Jeong Joon Park