Related papers: Controllable Augmentations for Video Representatio…

Composable Augmentation Encoding for Video Representation Learning

We focus on contrastive methods for self-supervised video representation learning. A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data…

Computer Vision and Pattern Recognition · Computer Science 2021-08-23 Chen Sun , Arsha Nagrani , Yonglong Tian , Cordelia Schmid

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective…

Computer Vision and Pattern Recognition · Computer Science 2021-10-15 Haider Al-Tahan , Yalda Mohsenzadeh

Spatiotemporal Contrastive Video Representation Learning

We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Our representations are learned using a contrastive loss, where two augmented clips…

Computer Vision and Pattern Recognition · Computer Science 2021-04-07 Rui Qian , Tianjian Meng , Boqing Gong , Ming-Hsuan Yang , Huisheng Wang , Serge Belongie , Yin Cui

Video Contrastive Learning with Global Context

Contrastive learning has revolutionized self-supervised image representation learning field, and recently been adapted to video domain. One of the greatest advantages of contrastive learning is that it allows us to flexibly define powerful…

Computer Vision and Pattern Recognition · Computer Science 2021-08-06 Haofei Kuang , Yi Zhu , Zhi Zhang , Xinyu Li , Joseph Tighe , Sören Schwertfeger , Cyrill Stachniss , Mu Li

Audio-Visual Contrastive Learning with Temporal Self-Supervision

We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision. In contrast to images that capture the static scene appearance, videos also…

Computer Vision and Pattern Recognition · Computer Science 2023-02-16 Simon Jenni , Alexander Black , John Collomosse

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision

Self-supervised approaches for video have shown impressive results in video understanding tasks. However, unlike early works that leverage temporal self-supervision, current state-of-the-art methods primarily rely on tasks from the image…

Computer Vision and Pattern Recognition · Computer Science 2023-12-21 Ishan Rajendrakumar Dave , Simon Jenni , Mubarak Shah

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

We propose a self-supervised method to learn feature representations from videos. A standard approach in traditional self-supervised methods uses positive-negative data pairs to train with contrastive learning strategy. In such a case,…

Computer Vision and Pattern Recognition · Computer Science 2020-08-13 Li Tao , Xueting Wang , Toshihiko Yamasaki

TCLR: Temporal Contrastive Learning for Video Representation

Contrastive learning has nearly closed the gap between supervised and self-supervised learning of image representations, and has also been explored for videos. However, prior work on contrastive learning for video data has not explored the…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Ishan Dave , Rohit Gupta , Mamshad Nayeem Rizve , Mubarak Shah

Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised Video Representation Learning

We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it…

Computer Vision and Pattern Recognition · Computer Science 2021-09-02 Zehua Zhang , David Crandall

Self-Supervised Visual Learning by Variable Playback Speeds Prediction of a Video

We propose a self-supervised visual learning method by predicting the variable playback speeds of a video. Without semantic labels, we learn the spatio-temporal visual representation of the video by leveraging the variations in the visual…

Computer Vision and Pattern Recognition · Computer Science 2021-06-02 Hyeon Cho , Taehoon Kim , Hyung Jin Chang , Wonjun Hwang

Time-Equivariant Contrastive Video Representation Learning

We introduce a novel self-supervised contrastive learning method to learn representations from unlabelled videos. Existing approaches ignore the specifics of input distortions, e.g., by learning invariance to temporal transformations.…

Computer Vision and Pattern Recognition · Computer Science 2021-12-08 Simon Jenni , Hailin Jin

Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment

Robust frame-wise embeddings are essential to perform video analysis and understanding tasks. We present a self-supervised method for representation learning based on aligning temporal video sequences. Our framework uses a transformer-based…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Keyne Oei , Amr Gomaa , Anna Maria Feit , João Belo

Supervised Contrastive Frame Aggregation for Video Representation Learning

We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Shaif Chowdhury , Mushfika Rahman , Greg Hamerly

Learning by Aligning Videos in Time

We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task, while exploiting both frame-level and video-level information. We leverage a novel combination of temporal alignment…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Sanjay Haresh , Sateesh Kumar , Huseyin Coskun , Shahram Najam Syed , Andrey Konin , Muhammad Zeeshan Zia , Quoc-Huy Tran

Contrastive Learning of Global-Local Video Representations

Contrastive learning has delivered impressive results for various tasks in the self-supervised regime. However, existing approaches optimize for learning representations specific to downstream scenarios, i.e., \textit{global}…

Machine Learning · Computer Science 2021-10-29 Shuang Ma , Zhaoyang Zeng , Daniel McDuff , Yale Song

Enhancing Self-supervised Video Representation Learning via Multi-level Feature Optimization

The crux of self-supervised video representation learning is to build general features from unlabeled videos. However, most recent works have mainly focused on high-level semantics and neglected lower-level representations and their…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Rui Qian , Yuxi Li , Huabin Liu , John See , Shuangrui Ding , Xian Liu , Dian Li , Weiyao Lin

Representation Learning via Global Temporal Alignment and Cycle-Consistency

We introduce a weakly supervised method for representation learning based on aligning temporal sequences (e.g., videos) of the same process (e.g., human action). The main idea is to use the global temporal ordering of latent correspondences…

Computer Vision and Pattern Recognition · Computer Science 2021-05-12 Isma Hadji , Konstantinos G. Derpanis , Allan D. Jepson

Time Is MattEr: Temporal Self-supervision for Video Transformers

Understanding temporal dynamics of video is an essential aspect of learning better video representations. Recently, transformer-based architectural designs have been extensively explored for video tasks due to their capability to capture…

Computer Vision and Pattern Recognition · Computer Science 2022-07-20 Sukmin Yun , Jaehyung Kim , Dongyoon Han , Hwanjun Song , Jung-Woo Ha , Jinwoo Shin

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation

To equip artificial intelligence with a comprehensive understanding towards a temporal world, video and 4D panoptic scene graph generation abstracts visual data into nodes to represent entities and edges to capture temporal relations.…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Thong Thanh Nguyen , Xiaobao Wu , Yi Bin , Cong-Duy T Nguyen , See-Kiong Ng , Anh Tuan Luu

Spatiotemporal Contrastive Learning of Facial Expressions in Videos

We propose a self-supervised contrastive learning approach for facial expression recognition (FER) in videos. We propose a novel temporal sampling-based augmentation scheme to be utilized in addition to standard spatial augmentations used…

Computer Vision and Pattern Recognition · Computer Science 2021-08-09 Shuvendu Roy , Ali Etemad