Related papers: Compositional Video Generation as Flow Equalizatio…

VideoTetris: Towards Compositional Text-to-Video Generation

Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 Ye Tian , Ling Yang , Haotian Yang , Yuan Gao , Yufan Deng , Jingmin Chen , Xintao Wang , Zhaochen Yu , Xin Tao , Pengfei Wan , Di Zhang , Bin Cui

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

Personalized text-to-image generation using diffusion models has recently emerged and garnered significant interest. This task learns a novel concept (e.g., a unique toy), illustrated in a handful of images, into a generative model that…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Shaozhe Hao , Kai Han , Shihao Zhao , Kwan-Yee K. Wong

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation

Text-to-video (T2V) generative models have advanced significantly, yet their ability to compose different objects, attributes, actions, and motions into a video remains unexplored. Previous text-to-video benchmarks also neglect this…

Computer Vision and Pattern Recognition · Computer Science 2025-01-16 Kaiyue Sun , Kaiyi Huang , Xian Liu , Yue Wu , Zihan Xu , Zhenguo Li , Xihui Liu

TIV-Diffusion: Towards Object-Centric Movement for Text-driven Image to Video Generation

Text-driven Image to Video Generation (TI2V) aims to generate controllable video given the first frame and corresponding textual description. The primary challenges of this task lie in two parts: (i) how to identify the target objects and…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Xingrui Wang , Xin Li , Yaosi Hu , Hanxin Zhu , Chen Hou , Cuiling Lan , Zhibo Chen

TC-Bench: Benchmarking Temporal Compositionality in Text-to-Video and Image-to-Video Generation

Video generation has many unique challenges beyond those of image generation. The temporal dimension introduces extensive possible variations across frames, over which consistency and continuity may be violated. In this study, we move…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Weixi Feng , Jiachen Li , Michael Saxon , Tsu-jui Fu , Wenhu Chen , William Yang Wang

VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models

Text-to-video diffusion models have advanced video generation significantly. However, customizing these models to generate videos with tailored motions presents a substantial challenge. In specific, they encounter hurdles in (a) accurately…

Computer Vision and Pattern Recognition · Computer Science 2023-12-05 Hyeonho Jeong , Geon Yeong Park , Jong Chul Ye

TokensGen: Harnessing Condensed Tokens for Long Video Generation

Generating consistent long videos is a complex challenge: while diffusion-based generative models generate visually impressive short clips, extending them to longer durations often leads to memory bottlenecks and long-term inconsistency. In…

Computer Vision and Pattern Recognition · Computer Science 2025-07-22 Wenqi Ouyang , Zeqi Xiao , Danni Yang , Yifan Zhou , Shuai Yang , Lei Yang , Jianlou Si , Xingang Pan

UniVid: Pyramid Diffusion Model for High Quality Video Generation

Diffusion-based text-to-video generation (T2V) or image-to-video (I2V) generation have emerged as a prominent research focus. However, there exists a challenge in integrating the two generative paradigms into a unified model. In this paper,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xinyu Xiao , Binbin Yang , Tingtian Li , Yipeng Yu , Sen Lei

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2023-11-21 Michal Geyer , Omer Bar-Tal , Shai Bagon , Tali Dekel

VSC: Visual Search Compositional Text-to-Image Diffusion Model

Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts…

Computer Vision and Pattern Recognition · Computer Science 2025-05-05 Do Huu Dat , Nam Hyeonu , Po-Yuan Mao , Tae-Hyun Oh

MVOC: a training-free multiple video object composition method with diffusion models

Video composition is the core task of video editing. Although image composition based on diffusion models has been highly successful, it is not straightforward to extend the achievement to video object composition tasks, which not only…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Wei Wang , Yaosen Chen , Yuegen Liu , Qi Yuan , Shubin Yang , Yanru Zhang

Bridging Text and Video Generation: A Survey

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating…

Graphics · Computer Science 2025-10-07 Nilay Kumar , Priyansh Bhandari , G. Maragatham

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Roberto Henschel , Levon Khachatryan , Hayk Poghosyan , Daniil Hayrapetyan , Vahram Tadevosyan , Zhangyang Wang , Shant Navasardyan , Humphrey Shi

Image-to-Video Diffusion: From Foundations to Open Frontiers

Diffusion-based \textit{image-to-video} (I2V) generation has become a central direction in generative models by turning a reference image, with optional conditions, into a temporally coherent video. Compared with broader video generation…

Computer Vision and Pattern Recognition · Computer Science 2026-05-19 Xianlong Wang , Wenbo Pan , Shijia Zhou , Ke Li , Yuqi Wang , Zeyu Ye , Hangtao Zhang , Leo Yu Zhang , Xiaohua Jia

Compositional Video Generation via Inference-Time Guidance

Text-to-video diffusion models generate realistic videos, but often fail on prompts requiring fine-grained compositional understanding, such as relations between entities, attributes, actions, and motion directions. We hypothesize that…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Ariel Shaulov , Eitan Shaar , Amit Edenzon , Gal Chechik , Lior Wolf

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of…

Computer Vision and Pattern Recognition · Computer Science 2023-09-28 Yaohui Wang , Xinyuan Chen , Xin Ma , Shangchen Zhou , Ziqi Huang , Yi Wang , Ceyuan Yang , Yinan He , Jiashuo Yu , Peiqing Yang , Yuwei Guo , Tianxing Wu , Chenyang Si , Yuming Jiang , Cunjian Chen , Chen Change Loy , Bo Dai , Dahua Lin , Yu Qiao , Ziwei Liu

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Haoxin Chen , Menghan Xia , Yingqing He , Yong Zhang , Xiaodong Cun , Shaoshu Yang , Jinbo Xing , Yaofang Liu , Qifeng Chen , Xintao Wang , Chao Weng , Ying Shan

Anchored Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models

State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they still frequently fail to compose complex scenes or follow logical temporal instructions. In this paper, we argue that many errors,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-26 Mariam Hassan , Bastien Van Delft , Wuyang Li , Alexandre Alahi

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these…

Computer Vision and Pattern Recognition · Computer Science 2024-01-18 Haoxin Chen , Yong Zhang , Xiaodong Cun , Menghan Xia , Xintao Wang , Chao Weng , Ying Shan

Generative Disco: Text-to-Video Generation for Music Visualization

Visuals can enhance our experience of music, owing to the way they can amplify the emotions and messages conveyed within it. However, creating music visualization is a complex, time-consuming, and resource-intensive process. We introduce…

Human-Computer Interaction · Computer Science 2023-09-29 Vivian Liu , Tao Long , Nathan Raw , Lydia Chilton