English
Related papers

Related papers: Patch-based Object-centric Transformers for Effici…

200 papers

Transformers have been successful for many natural language processing tasks. However, applying transformers to the video domain for tasks such as long-term video generation and scene understanding has remained elusive due to the high…

Machine Learning · Computer Science 2021-07-21 Yi-Fu Wu , Jaesik Yoon , Sungjin Ahn

Recently, video transformers have shown great success in video understanding, exceeding CNN performance; yet existing video transformer models do not explicitly model objects, although objects can be essential for recognizing actions. In…

Computer Vision and Pattern Recognition · Computer Science 2022-06-13 Roei Herzig , Elad Ben-Avraham , Karttikeya Mangalam , Amir Bar , Gal Chechik , Anna Rohrbach , Trevor Darrell , Amir Globerson

We propose a novel framework for the task of object-centric video prediction, i.e., extracting the compositional structure of a video sequence, as well as modeling objects dynamics and interactions from visual observations in order to…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Angel Villar-Corrales , Ismail Wahdan , Sven Behnke

We propose Anticipative Video Transformer (AVT), an end-to-end attention-based video modeling architecture that attends to the previously observed video in order to anticipate future actions. We train the model jointly to predict the next…

Computer Vision and Pattern Recognition · Computer Science 2021-09-23 Rohit Girdhar , Kristen Grauman

Video transformers have recently emerged as an effective alternative to convolutional networks for action classification. However, most prior video transformers adopt either global space-time attention or hand-defined strategies to compare…

Computer Vision and Pattern Recognition · Computer Science 2022-04-01 Jue Wang , Lorenzo Torresani

Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Without associating the moving trajectories, these image-based data-driven methods cannot…

Computer Vision and Pattern Recognition · Computer Science 2020-07-15 Fangyi Zhu , Jenq-Neng Hwang , Zhanyu Ma , Guang Chen , Jun Guo

The objective of this work is to learn an object-centric video representation, with the aim of improving transferability to novel tasks, i.e., tasks different from the pre-training task of action classification. To this end, we introduce a…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Chuhan Zhang , Ankush Gupta , Andrew Zisserman

Controllable video generation has attracted significant attention, largely due to advances in video diffusion models. In domains such as autonomous driving, it is essential to develop highly accurate predictions for object motions. This…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ge Ya Luo , Zhi Hao Luo , Anthony Gosselin , Alexia Jolicoeur-Martineau , Christopher Pal

Answering questions related to audio-visual scenes, i.e., the AVQA task, is becoming increasingly popular. A critical challenge is accurately identifying and tracking sounding objects related to the question along the timeline. In this…

Multimedia · Computer Science 2024-12-17 Zhangbin Li , Jinxing Zhou , Jing Zhang , Shengeng Tang , Kun Li , Dan Guo

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top…

Computer Vision and Pattern Recognition · Computer Science 2024-07-10 Rui Qian , Shuangrui Ding , Dahua Lin

We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations, extending our previous work, SlotAdapt, from images to video. While existing object-centric approaches…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Adil Kaan Akan , Yucel Yemez

We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our…

Computer Vision and Pattern Recognition · Computer Science 2023-09-06 Apoorva Beedu , Huda Alamri , Irfan Essa

Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT)…

This paper focuses on building object-centric representations for long-term action anticipation in videos. Our key motivation is that objects provide important cues to recognize and predict human-object interactions, especially when the…

Computer Vision and Pattern Recognition · Computer Science 2023-11-02 Ce Zhang , Changcheng Fu , Shijie Wang , Nakul Agarwal , Kwonjoon Lee , Chiho Choi , Chen Sun

We present the Object Language Video Transformer (OLViT) - a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and…

Computer Vision and Pattern Recognition · Computer Science 2024-02-21 Adnen Abdessaied , Manuel von Hochmeister , Andreas Bulling

Video captioning aims to automatically generate natural language descriptions of video content, which has drawn a lot of attention recent years. Generating accurate and fine-grained captions needs to not only understand the global content…

Computer Vision and Pattern Recognition · Computer Science 2019-06-12 Junchao Zhang , Yuxin Peng

This paper proposes a novel deep learning-based video object matting method that can achieve temporally coherent matting results. Its key component is an attention-based temporal aggregation module that maximizes image matting networks'…

Computer Vision and Pattern Recognition · Computer Science 2021-07-30 Yunke Zhang , Chi Wang , Miaomiao Cui , Peiran Ren , Xuansong Xie , Xian-sheng Hua , Hujun Bao , Qixing Huang , Weiwei Xu

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses…

Computer Vision and Pattern Recognition · Computer Science 2026-04-24 Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , László A. Jeni , Kris M. Kitani

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Jie An , Songyang Zhang , Harry Yang , Sonal Gupta , Jia-Bin Huang , Jiebo Luo , Xi Yin

Video prediction is a challenging computer vision task that has a wide range of applications. In this work, we present a new family of Transformer-based models for video prediction. Firstly, an efficient local spatial-temporal separation…

Computer Vision and Pattern Recognition · Computer Science 2022-12-13 Xi Ye , Guillaume-Alexandre Bilodeau
‹ Prev 1 2 3 10 Next ›