Related papers: Action-conditioned video data improves predictabil…

Action-conditioned Benchmarking of Robotic Video Prediction Models: a Comparative Study

A defining characteristic of intelligent systems is the ability to make action decisions based on the anticipated outcomes. Video prediction systems have been demonstrated as a solution for predicting how the future will unfold visually,…

Computer Vision and Pattern Recognition · Computer Science 2019-10-08 Manuel Serra Nunes , Atabak Dehban , Plinio Moreno , José Santos-Victor

Video Prediction with Appearance and Motion Conditions

Video prediction aims to generate realistic future frames by learning dynamic visual patterns. One fundamental challenge is to deal with future uncertainty: How should a model behave when there are multiple correct, equally probable future?…

Computer Vision and Pattern Recognition · Computer Science 2018-07-10 Yunseok Jang , Gunhee Kim , Yale Song

TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator

Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video…

Computer Vision and Pattern Recognition · Computer Science 2021-06-29 Doyeon Kim , Donggyu Joo , Junmo Kim

Say, Dream, and Act: Learning Video World Models for Instruction-Driven Robot Manipulation

Robotic manipulation requires anticipating how the environment evolves in response to actions, yet most existing systems lack this predictive capability, often resulting in errors and inefficiency. While Vision-Language Models (VLMs)…

Robotics · Computer Science 2026-02-12 Songen Gu , Yunuo Cai , Tianyu Wang , Simo Wu , Yanwei Fu

Motion Generation: A Survey of Generative Approaches and Benchmarks

Motion generation, the task of synthesizing realistic motion sequences from various conditioning inputs, has become a central problem in computer vision, computer graphics, and robotics, with applications ranging from animation and virtual…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Aliasghar Khani , Arianna Rampini , Bruno Roy , Larasika Nadela , Noa Kaplan , Evan Atherton , Derek Cheung , Jacky Bibliowicz

CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers

We extend multimodal transformers to include 3D camera motion as a conditioning signal for the task of video generation. Generative video models are becoming increasingly powerful, thus focusing research efforts on methods of controlling…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Andrew Marmon , Grant Schindler , José Lezama , Dan Kondratyuk , Bryan Seybold , Irfan Essa

AMG: Avatar Motion Guided Video Generation

Human video generation task has gained significant attention with the advancement of deep generative models. Generating realistic videos with human movements is challenging in nature, due to the intricacies of human body topology and…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Zhangsihao Yang , Mengyi Shan , Mohammad Farazi , Wenhui Zhu , Yanxi Chen , Xuanzhao Dong , Yalin Wang

Generative Inbetweening through Frame-wise Conditions-Driven Video Generation

Generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input. Although remarkable progress has been made in video generation models, generative inbetweening still faces challenges in maintaining…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Tianyi Zhu , Dongwei Ren , Qilong Wang , Xiaohe Wu , Wangmeng Zuo

Video Generators are Robot Policies

Despite tremendous progress in dexterous manipulation, current visuomotor policies remain fundamentally limited by two challenges: they struggle to generalize under perceptual or behavioral distribution shifts, and their performance is…

Robotics · Computer Science 2025-08-04 Junbang Liang , Pavel Tokmakov , Ruoshi Liu , Sruthi Sudhakar , Paarth Shah , Rares Ambrus , Carl Vondrick

Unified Video Action Model

A unified video and action model holds significant promise for robotics, where videos provide rich scene information for action prediction, and actions provide dynamics information for video prediction. However, effectively combining video…

Robotics · Computer Science 2025-04-28 Shuang Li , Yihuai Gao , Dorsa Sadigh , Shuran Song

Conditional Video Generation Using Action-Appearance Captions

The field of automatic video generation has received a boost thanks to the recent Generative Adversarial Networks (GANs). However, most existing methods cannot control the contents of the generated video using a text caption, losing their…

Computer Vision and Pattern Recognition · Computer Science 2018-12-06 Shohei Yamamoto , Antonio Tejero-de-Pablos , Yoshitaka Ushiku , Tatsuya Harada

Attentive Semantic Video Generation using Captions

This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term…

Computer Vision and Pattern Recognition · Computer Science 2017-11-17 Tanya Marwah , Gaurav Mittal , Vineeth N. Balasubramanian

EnerVerse-AC: Envisioning Embodied Environments with Action Condition

Robotic imitation learning has advanced from solving static tasks to addressing dynamic interaction scenarios, but testing and evaluation remain costly and challenging due to the need for real-time interaction with dynamic environments. We…

Robotics · Computer Science 2025-05-16 Yuxin Jiang , Shengcong Chen , Siyuan Huang , Liliang Chen , Pengfei Zhou , Yue Liao , Xindong He , Chiming Liu , Hongsheng Li , Maoqing Yao , Guanghui Ren

Playable Video Generation

This paper introduces the unsupervised learning problem of playable video generation (PVG). In PVG, we aim at allowing a user to control the generated video by selecting a discrete action at every time step as when playing a video game. The…

Computer Vision and Pattern Recognition · Computer Science 2021-01-29 Willi Menapace , Stéphane Lathuilière , Sergey Tulyakov , Aliaksandr Siarohin , Elisa Ricci

Make-A-Story: Visual Memory Conditioned Consistent Story Generation

There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous…

Computer Vision and Pattern Recognition · Computer Science 2023-05-09 Tanzila Rahman , Hsin-Ying Lee , Jian Ren , Sergey Tulyakov , Shweta Mahajan , Leonid Sigal

Collaboratively Self-supervised Video Representation Learning for Action Recognition

Considering the close connection between action recognition and human pose estimation, we design a Collaboratively Self-supervised Video Representation (CSVR) learning framework specific to action recognition by jointly factoring in…

Computer Vision and Pattern Recognition · Computer Science 2025-02-03 Jie Zhang , Zhifan Wan , Lanqing Hu , Stephen Lin , Shuzhe Wu , Shiguang Shan

VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

Recent advances in robot foundation models trained on large-scale human teleoperation data have enabled robots to perform increasingly complex real-world tasks. However, scaling these systems remains difficult because collecting…

Robotics · Computer Science 2026-04-13 Xiaolei Lang , Yang Wang , Yukun Zhou , Chaojun Ni , Kerui Li , Jiagang Zhu , Tianze Liu , Jiajun Lv , Xingxing Zuo , Yun Ye , Guan Huang , Xiaofeng Wang , Zheng Zhu

Generating Human Motion Videos using a Cascaded Text-to-Video Framework

Human video generation is becoming an increasingly important task with broad applications in graphics, entertainment, and embodied AI. Despite the rapid progress of video diffusion models (VDMs), their use for general-purpose human video…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Hyelin Nam , Hyojun Go , Byeongjun Park , Byung-Hoon Kim , Hyungjin Chung

Lets Play Music: Audio-driven Performance Video Generation

We propose a new task named Audio-driven Per-formance Video Generation (APVG), which aims to synthesizethe video of a person playing a certain instrument guided bya given music audio clip. It is a challenging task to gener-ate the…

Computer Vision and Pattern Recognition · Computer Science 2020-11-06 Hao Zhu , Yi Li , Feixia Zhu , Aihua Zheng , Ran He

Generating Long Videos of Dynamic Scenes

We present a video generation model that accurately reproduces object motion, changes in camera viewpoint, and new content that arises over time. Existing video generation methods often fail to produce new content as a function of time…

Computer Vision and Pattern Recognition · Computer Science 2022-06-10 Tim Brooks , Janne Hellsten , Miika Aittala , Ting-Chun Wang , Timo Aila , Jaakko Lehtinen , Ming-Yu Liu , Alexei A. Efros , Tero Karras