Related papers: Scaling 4D Representations

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because…

Computer Vision and Pattern Recognition · Computer Science 2019-06-07 Priya Goyal , Dhruv Mahajan , Abhinav Gupta , Ishan Misra

Scaling may be all you need for achieving human-level object recognition capacity with human-like visual experience

This paper asks whether current self-supervised learning methods, if sufficiently scaled up, would be able to reach human-level visual object recognition capabilities with the same type and amount of visual experience humans learn from.…

Computer Vision and Pattern Recognition · Computer Science 2023-08-11 A. Emin Orhan

Masked Autoencoders Are Scalable Vision Learners

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core…

Computer Vision and Pattern Recognition · Computer Science 2021-12-21 Kaiming He , Xinlei Chen , Saining Xie , Yanghao Li , Piotr Dollár , Ross Girshick

Self-Supervised Learning via multi-Transformation Classification for Action Recognition

Self-supervised tasks have been utilized to build useful representations that can be used in downstream tasks when the annotation is unavailable. In this paper, we introduce a self-supervised video representation learning method based on…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Duc Quang Vu , Ngan T. H. Le , Jia-Ching Wang

Video 3D Sampling for Self-supervised Representation Learning

Most of the existing video self-supervised methods mainly leverage temporal signals of videos, ignoring that the semantics of moving objects and environmental information are all critical for video-related tasks. In this paper, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2021-07-09 Wei Li , Dezhao Luo , Bo Fang , Yu Zhou , Weiping Wang

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos

Self-supervised representation learning for point cloud videos remains a challenging problem with two key limitations: (1) existing methods rely on explicit knowledge to learn motion, resulting in suboptimal representations; (2) prior…

Computer Vision and Pattern Recognition · Computer Science 2025-05-21 Zhi Zuo , Chenyi Zhuang , Pan Gao , Jie Qin , Hao Feng , Nicu Sebe

Masked Autoencoding Does Not Help Natural Language Supervision at Scale

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these…

Computer Vision and Pattern Recognition · Computer Science 2023-05-16 Floris Weers , Vaishaal Shankar , Angelos Katharopoulos , Yinfei Yang , Tom Gunter

A Large-Scale Analysis on Self-Supervised Video Representation Learning

Self-supervised learning is an effective way for label-free model pre-training, especially in the video domain where labeling is expensive. Existing self-supervised works in the video domain use varying experimental setups to demonstrate…

Computer Vision and Pattern Recognition · Computer Science 2023-11-22 Akash Kumar , Ashlesha Kumar , Vibhav Vineet , Yogesh Singh Rawat

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders

Current video-based Masked Autoencoders (MAEs) primarily focus on learning effective spatiotemporal representations from a visual perspective, which may lead the model to prioritize general spatial-temporal patterns but often overlook…

Computer Vision and Pattern Recognition · Computer Science 2025-02-13 Shihab Aaqil Ahamed , Malitha Gunawardhana , Liel David , Michael Sidorov , Daniel Harari , Muhammad Haris Khan

Real-World Robot Learning with Masked Visual Pre-training

In this work, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks. Like prior work, our visual representations are pre-trained via a masked autoencoder (MAE), frozen, and…

Robotics · Computer Science 2022-10-07 Ilija Radosavovic , Tete Xiao , Stephen James , Pieter Abbeel , Jitendra Malik , Trevor Darrell

MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D…

Computer Vision and Pattern Recognition · Computer Science 2024-03-14 Jialv Zou , Bencheng Liao , Qian Zhang , Wenyu Liu , Xinggang Wang

Self-Supervised Learning for Videos: A Survey

The remarkable success of deep learning in various domains relies on the availability of large-scale annotated datasets. However, obtaining annotations is expensive and requires great effort, which is especially challenging for videos.…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Madeline C. Schiappa , Yogesh S. Rawat , Mubarak Shah

Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning

Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models…

Computer Vision and Pattern Recognition · Computer Science 2023-09-25 Colorado J. Reed , Ritwik Gupta , Shufan Li , Sarah Brockman , Christopher Funk , Brian Clipp , Kurt Keutzer , Salvatore Candido , Matt Uyttendaele , Trevor Darrell

Masked Modeling for Self-supervised Representation Learning on Vision and Beyond

As the deep learning revolution marches on, self-supervised learning has garnered increasing attention in recent years thanks to its remarkable representation learning ability and the low dependence on labeled data. Among these varied…

Computer Vision and Pattern Recognition · Computer Science 2024-01-10 Siyuan Li , Luyuan Zhang , Zedong Wang , Di Wu , Lirong Wu , Zicheng Liu , Jun Xia , Cheng Tan , Yang Liu , Baigui Sun , Stan Z. Li

Self-supervised Video Representation Learning by Uncovering Spatio-temporal Statistics

This paper proposes a novel pretext task to address the self-supervised video representation learning problem. Specifically, given an unlabeled video clip, we compute a series of spatio-temporal statistical summaries, such as the spatial…

Computer Vision and Pattern Recognition · Computer Science 2021-02-01 Jiangliu Wang , Jianbo Jiao , Linchao Bao , Shengfeng He , Wei Liu , Yun-hui Liu

Simulated Cortical Magnification Supports Self-Supervised Object Learning

Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Zhengyang Yu , Arthur Aubret , Chen Yu , Jochen Triesch

Improving Visual Representation Learning through Perceptual Understanding

We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual…

Computer Vision and Pattern Recognition · Computer Science 2023-03-29 Samyakh Tukra , Frederick Hoffman , Ken Chatfield

Self-supervised Representation Learning for Ultrasound Video

Recent advances in deep learning have achieved promising performance for medical image analysis, while in most cases ground-truth annotations from human experts are necessary to train the deep model. In practice, such annotations are…

Computer Vision and Pattern Recognition · Computer Science 2020-03-03 Jianbo Jiao , Richard Droste , Lior Drukker , Aris T. Papageorghiou , J. Alison Noble

Masked Autoencoders are Scalable Learners of Cellular Morphology

Inferring biological relationships from cellular phenotypes in high-content microscopy screens provides significant opportunity and challenge in biological research. Prior results have shown that deep vision models can capture biological…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Oren Kraus , Kian Kenyon-Dean , Saber Saberian , Maryam Fallah , Peter McLean , Jess Leung , Vasudev Sharma , Ayla Khan , Jia Balakrishnan , Safiye Celik , Maciej Sypetkowski , Chi Vicky Cheng , Kristen Morse , Maureen Makes , Ben Mabey , Berton Earnshaw

Masked Scene Modeling: Narrowing the Gap Between Supervised and Self-Supervised Learning in 3D Scene Understanding

Self-supervised learning has transformed 2D computer vision by enabling models trained on large, unannotated datasets to provide versatile off-the-shelf features that perform similarly to models trained with labels. However, in 3D scene…

Computer Vision and Pattern Recognition · Computer Science 2025-04-10 Pedro Hermosilla , Christian Stippel , Leon Sick