Video Occupancy Models

Manan Tomar; Philippe Hansen-Estruch; Philip Bachman; Alex Lamb; John Langford; Matthew E. Taylor; Sergey Levine

Video Occupancy Models

Computer Vision and Pattern Recognition 2024-07-16 v1 Artificial Intelligence

Authors: Manan Tomar , Philippe Hansen-Estruch , Philip Bachman , Alex Lamb , John Langford , Matthew E. Taylor , Sergey Levine

View on arXiv ↗ PDF ↗

Abstract

We introduce a new family of video prediction models designed to support downstream control tasks. We call these models Video Occupancy models (VOCs). VOCs operate in a compact latent space, thus avoiding the need to make predictions about individual pixels. Unlike prior latent-space world models, VOCs directly predict the discounted distribution of future states in a single step, thus avoiding the need for multistep roll-outs. We show that both properties are beneficial when building predictive models of video for use in downstream control. Code is available at \href{https://github.com/manantomar/video-occupancy-models}{\texttt{github.com/manantomar/video-occupancy-models}}.

Keywords

occupancy prediction video generation video understanding

Cite

@article{arxiv.2407.09533,
  title  = {Video Occupancy Models},
  author = {Manan Tomar and Philippe Hansen-Estruch and Philip Bachman and Alex Lamb and John Langford and Matthew E. Taylor and Sergey Levine},
  journal= {arXiv preprint arXiv:2407.09533},
  year   = {2024}
}

Video Occupancy Models

Abstract

Keywords

Cite

Related papers