Related papers: Patch-based Object-centric Transformers for Effici…

Generative Video Transformer: Can Objects be the Words?

Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high…

Machine Learning · Computer Science 2021-07-21 Yi-Fu Wu , Jaesik Yoon , Sungjin Ahn

Object-Region Video Transformers

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In…

Computer Vision and Pattern Recognition · Computer Science 2022-06-13 Roei Herzig , Elad Ben-Avraham , Karttikeya Mangalam , Amir Bar , Gal Chechik , Anna Rohrbach , Trevor Darrell , Amir Globerson

Object-Centric Video Prediction via Decoupling of Object Dynamics and Interactions

We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Angel Villar-Corrales , Ismail Wahdan , Sven Behnke

Anticipative Video Transformer

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next…

Computer Vision and Pattern Recognition · Computer Science 2021-09-23 Rohit Girdhar , Kristen Grauman

Deformable Video Transformer

Video transformers have recently emerged as an effective alternative to convolutional networks for action classification. However, most prior video transformers adopt either global space-time attention or hand-defined strategies to compare…

Computer Vision and Pattern Recognition · Computer Science 2022-04-01 Jue Wang , Lorenzo Torresani

OVC-Net: Object-Oriented Video Captioning with Temporal Graph and Detail Enhancement

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Without associating the moving trajectories, these image-based data-driven methods cannot…

Computer Vision and Pattern Recognition · Computer Science 2020-07-15 Fangyi Zhu , Jenq-Neng Hwang , Zhanyu Ma , Guang Chen , Jun Guo

Is an Object-Centric Video Representation Beneficial for Transfer?

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Chuhan Zhang , Ankush Gupta , Andrew Zisserman

Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ge Ya Luo , Zhi Hao Luo , Anthony Gosselin , Alexia Jolicoeur-Martineau , Christopher Pal

Patch-level Sounding Object Tracking for Audio-Visual Question Answering

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this…

Multimedia · Computer Science 2024-12-17 Zhangbin Li , Jinxing Zhou , Jing Zhang , Shengeng Tang , Kun Li , Dan Guo

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top…

Computer Vision and Pattern Recognition · Computer Science 2024-07-10 Rui Qian , Shuangrui Ding , Dahua Lin

Compositional Video Synthesis by Temporal Object-Centric Learning

We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Adil Kaan Akan , Yucel Yemez

Video based Object 6D Pose Estimation using Transformers

We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Apoorva Beedu , Huda Alamri , Irfan Essa

Object-Centric Multiple Object Tracking

Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT)…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Zixu Zhao , Jiaze Wang , Max Horn , Yizhuo Ding , Tong He , Zechen Bai , Dominik Zietlow , Carl-Johann Simon-Gabriel , Bing Shuai , Zhuowen Tu , Thomas Brox , Bernt Schiele , Yanwei Fu , Francesco Locatello , Zheng Zhang , Tianjun Xiao

Object-centric Video Representation for Long-term Action Anticipation

This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the…

Computer Vision and Pattern Recognition · Computer Science 2023-11-02 Ce Zhang , Changcheng Fu , Shijie Wang , Nakul Agarwal , Kwonjoon Lee , Chiho Choi , Chen Sun

OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog

We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-21 Adnen Abdessaied , Manuel von Hochmeister , Andreas Bulling

Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning

Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content…

Computer Vision and Pattern Recognition · Computer Science 2019-06-12 Junchao Zhang , Yuxin Peng

Attention-guided Temporally Coherent Video Object Matting

This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks'…

Computer Vision and Pattern Recognition · Computer Science 2021-07-30 Yunke Zhang , Chi Wang , Miaomiao Cui , Peiran Ren , Xuansong Xie , Xian-sheng Hua , Hujun Bao , Qixing Huang , Weiwei Xu

Accelerating Vision Transformers with Adaptive Patch Sizes

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , László A. Jeni , Kris M. Kitani

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Jie An , Songyang Zhang , Harry Yang , Sonal Gupta , Jia-Bin Huang , Jiebo Luo , Xi Yin

Video Prediction by Efficient Transformers

Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation…

Computer Vision and Pattern Recognition · Computer Science 2022-12-13 Xi Ye , Guillaume-Alexandre Bilodeau