Scaling 4D Representations

João Carreira; Dilara Gokay; Michael King; Chuhan Zhang; Ignacio Rocco; Aravindh Mahendran; Thomas Albert Keck; Joseph Heyward; Skanda Koppula; Etienne Pot; Goker Erdogan; Yana Hasson; Yi Yang; Klaus Greff; Guillaume Le Moing; Sjoerd van Steenkiste; Daniel Zoran; Drew A. Hudson; Pedro Vélez; Luisa Polanía; Luke Friedman; Chris Duvarney; Ross Goroshin; Kelsey Allen; Jacob Walker; Rishabh Kabra; Eric Aboussouan; Jennifer Sun; Thomas Kipf; Carl Doersch; Viorica Pătrăucean; Dima Damen; Pauline Luc; Mehdi S. M. Sajjadi; Andrew Zisserman

Scaling 4D Representations

Computer Vision and Pattern Recognition 2025-07-10 v2 Artificial Intelligence Machine Learning

Authors: João Carreira , Dilara Gokay , Michael King , Chuhan Zhang , Ignacio Rocco , Aravindh Mahendran , Thomas Albert Keck , Joseph Heyward , Skanda Koppula , Etienne Pot , Goker Erdogan , Yana Hasson , Yi Yang , Klaus Greff , Guillaume Le Moing , Sjoerd van Steenkiste , Daniel Zoran , Drew A. Hudson , Pedro Vélez , Luisa Polanía , Luke Friedman , Chris Duvarney , Ross Goroshin , Kelsey Allen , Jacob Walker , Rishabh Kabra , Eric Aboussouan , Jennifer Sun , Thomas Kipf , Carl Doersch , Viorica Pătrăucean , Dima Damen , Pauline Luc , Mehdi S. M. Sajjadi , Andrew Zisserman

View on arXiv ↗ PDF ↗

Abstract

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode{x2013}$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode{x2013}$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling 4D representations. Pretrained models are available at https://github.com/google-deepmind/representations4d .

Keywords

masked autoencoder self-supervised learning representation learning

Cite

@article{arxiv.2412.15212,
  title  = {Scaling 4D Representations},
  author = {João Carreira and Dilara Gokay and Michael King and Chuhan Zhang and Ignacio Rocco and Aravindh Mahendran and Thomas Albert Keck and Joseph Heyward and Skanda Koppula and Etienne Pot and Goker Erdogan and Yana Hasson and Yi Yang and Klaus Greff and Guillaume Le Moing and Sjoerd van Steenkiste and Daniel Zoran and Drew A. Hudson and Pedro Vélez and Luisa Polanía and Luke Friedman and Chris Duvarney and Ross Goroshin and Kelsey Allen and Jacob Walker and Rishabh Kabra and Eric Aboussouan and Jennifer Sun and Thomas Kipf and Carl Doersch and Viorica Pătrăucean and Dima Damen and Pauline Luc and Mehdi S. M. Sajjadi and Andrew Zisserman},
  journal= {arXiv preprint arXiv:2412.15212},
  year   = {2025}
}

Scaling 4D Representations

Abstract

Keywords

Cite

Related papers