Related papers: Masked Motion Encoding for Self-Supervised Video R…

Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders

Masked autoencoders (MAEs) have emerged recently as art self-supervised spatiotemporal representation learners. Inheriting from the image counterparts, however, existing video MAEs still focus largely on static appearance learning whilst…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Haosen Yang , Deng Huang , Bin Wen , Jiannan Wu , Hongxun Yao , Yi Jiang , Xiatian Zhu , Zehuan Yuan

TrackMAE: Video Representation Learning via Track Mask and Predict

Masked video modeling (MVM) has emerged as a simple and scalable self-supervised pretraining paradigm, but only encodes motion information implicitly, limiting the encoding of temporal dynamics in the learned representations. As a result,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Renaud Vandeghen , Fida Mohammad Thoker , Marc Van Droogenbroeck , Bernard Ghanem

Motion-Guided Masking for Spatiotemporal Representation Learning

Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 David Fan , Jue Wang , Shuai Liao , Yi Zhu , Vimal Bhat , Hector Santos-Villalobos , Rohith MV , Xinyu Li

MGMAE: Motion Guided Masking for Video Masked Autoencoding

Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Bingkun Huang , Zhiyu Zhao , Guozhen Zhang , Yu Qiao , Limin Wang

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline. They have demonstrated outstanding effectiveness on downstream video tasks and superior data efficiency on small datasets. However,…

Computer Vision and Pattern Recognition · Computer Science 2022-10-12 Yuxin Song , Min Yang , Wenhao Wu , Dongliang He , Fu Li , Jingdong Wang

LV-MAE: Learning Long Video Representations through Masked-Embedding Autoencoders

In this work, we introduce long-video masked-embedding autoencoders (LV-MAE), a self-supervised learning framework for long video representation. Our approach treats short- and long-span dependencies as two separate tasks. Such decoupling…

Computer Vision and Pattern Recognition · Computer Science 2025-10-08 Ilan Naiman , Emanuel Ben-Baruch , Oron Anschel , Alon Shoshan , Igor Kviatkovsky , Manoj Aggarwal , Gerard Medioni

SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning

Masked video modeling, such as VideoMAE, is an effective paradigm for video self-supervised learning (SSL). However, they are primarily based on reconstructing pixel-level details on natural videos which have substantial temporal…

Computer Vision and Pattern Recognition · Computer Science 2025-04-02 Fida Mohammad Thoker , Letian Jiang , Chen Zhao , Bernard Ghanem

The TIME Machine: On The Power of Motion for Efficient Perception

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Mantas Skackauskas , Xinyue Hao , Laura Sevilla-Lara

Concatenated Masked Autoencoders as Spatial-Temporal Learner

Learning representations from videos requires understanding continuous motion and visual correspondences between frames. In this paper, we introduce the Concatenated Masked Autoencoders (CatMAE) as a spatial-temporal learner for…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Zhouqiang Jiang , Bowen Wang , Tong Xiang , Zhaofeng Niu , Hong Tang , Guangshun Li , Liangzhi Li

Social-MAE: Social Masked Autoencoder for Multi-person Motion Representation Learning

For a complete comprehension of multi-person scenes, it is essential to go beyond basic tasks like detection and tracking. Higher-level tasks, such as understanding the interactions and social activities among individuals, are also crucial.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Mahsa Ehsanpour , Ian Reid , Hamid Rezatofighi

MV2MAE: Multi-View Video Masked Autoencoders

Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from…

Computer Vision and Pattern Recognition · Computer Science 2024-01-30 Ketul Shah , Robert Crandall , Jie Xu , Peng Zhou , Marian George , Mayank Bansal , Rama Chellappa

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Shihab Aaqil Ahamed , Malitha Gunawardhana , Liel David , Michael Sidorov , Daniel Harari , Muhammad Haris Khan

Masking Modalities for Cross-modal Video Retrieval

Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy…

Computer Vision and Pattern Recognition · Computer Science 2021-11-04 Valentin Gabeur , Arsha Nagrani , Chen Sun , Karteek Alahari , Cordelia Schmid

SurgMAE: Masked Autoencoders for Long Surgical Video Analysis

There has been a growing interest in using deep learning models for processing long surgical videos, in order to automatically detect clinical/operational activities and extract metrics that can enable workflow efficiency tools and…

Computer Vision and Pattern Recognition · Computer Science 2023-05-22 Muhammad Abdullah Jamal , Omid Mohareri

Unsupervised Learning of Video Representations using LSTMs

We use multilayer Long Short Term Memory (LSTM) networks to learn representations of video sequences. Our model uses an encoder LSTM to map an input sequence into a fixed length representation. This representation is decoded using single or…

Machine Learning · Computer Science 2016-01-05 Nitish Srivastava , Elman Mansimov , Ruslan Salakhutdinov

Improvements to Self-Supervised Representation Learning for Masked Image Modeling

This paper explores improvements to the masked image modeling (MIM) paradigm. The MIM paradigm enables the model to learn the main object features of the image by masking the input image and predicting the masked part by the unmasked part.…

Computer Vision and Pattern Recognition · Computer Science 2022-05-24 Jiawei Mao , Xuesong Yin , Yuanqi Chang , Honggu Zhou

Learning with Unmasked Tokens Drives Stronger Vision Learners

Masked image modeling (MIM) has become a leading self-supervised learning strategy. MIMs such as Masked Autoencoder (MAE) learn strong representations by randomly masking input tokens for the encoder to process, with the decoder…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Taekyung Kim , Sanghyuk Chun , Byeongho Heo , Dongyoon Han

Self-supervised Spatio-temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics

We address the problem of video representation learning without human-annotated labels. While previous efforts address the problem by designing novel self-supervised tasks using video data, the learned features are merely on a…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Jiangliu Wang , Jianbo Jiao , Linchao Bao , Shengfeng He , Yunhui Liu , Wei Liu

Multi-Domain Motion Embedding: Expressive Real-Time Mimicry for Legged Robots

Effective motion representation is crucial for enabling robots to imitate expressive behaviors in real time, yet existing motion controllers often ignore inherent patterns in motion. Previous efforts in representation learning do not…

Robotics · Computer Science 2025-12-09 Matthias Heyrman , Chenhao Li , Victor Klemm , Dongho Kang , Stelian Coros , Marco Hutter

Unsupervised Learning of Long-Term Motion Dynamics for Videos

We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos. Given a pair of images from a video clip, our framework learns to predict the long-term 3D motions. To reduce the…

Computer Vision and Pattern Recognition · Computer Science 2017-04-13 Zelun Luo , Boya Peng , De-An Huang , Alexandre Alahi , Li Fei-Fei