Related papers: SimDA: Simple Diffusion Adapter for Efficient Vide…

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-20 Jay Zhangjie Wu , Yixiao Ge , Xintao Wang , Weixian Lei , Yuchao Gu , Yufei Shi , Wynne Hsu , Ying Shan , Xiaohu Qie , Mike Zheng Shou

Few-Shot-Based Modular Image-to-Video Adapter for Diffusion Models

Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Zhenhao Li , Shaohan Yi , Zheng Liu , Leonartinus Gao , Minh Ngoc Le , Ambrose Ling , Zhuoran Wang , Md Amirul Islam , Zhixiang Chi , Yuanhao Yu

Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion

Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Yangfan He , Sida Li , Jianhui Wang , Kun Li , Xinyuan Song , Xinhang Yuan , Keqin Li , Kuan Lu , Menghao Huo , Jingqun Tang , Yi Xin , Jiaqi Chen , Miao Zhang , Xueqian Wang

Video Text Preservation with Synthetic Text-Rich Videos

While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Ziyang Liu , Kevin Valencia , Justin Cui

PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces…

Computer Vision and Pattern Recognition · Computer Science 2024-01-01 Junsong Chen , Jincheng Yu , Chongjian Ge , Lewei Yao , Enze Xie , Yue Wu , Zhongdao Wang , James Kwok , Ping Luo , Huchuan Lu , Zhenguo Li

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM)…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Can Qin , Congying Xia , Krithika Ramakrishnan , Michael Ryoo , Lifu Tu , Yihao Feng , Manli Shu , Honglu Zhou , Anas Awadalla , Jun Wang , Senthil Purushwalkam , Le Xue , Yingbo Zhou , Huan Wang , Silvio Savarese , Juan Carlos Niebles , Zeyuan Chen , Ran Xu , Caiming Xiong

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Roberto Henschel , Levon Khachatryan , Hayk Poghosyan , Daniil Hayrapetyan , Vahram Tadevosyan , Zhangyang Wang , Shant Navasardyan , Humphrey Shi

VSA: Faster Video Diffusion with Trainable Sparse Attention

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Peiyuan Zhang , Yongqi Chen , Haofeng Huang , Will Lin , Zhengzhong Liu , Ion Stoica , Eric Xing , Hao Zhang

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V)…

Computer Vision and Pattern Recognition · Computer Science 2024-06-28 Xun Guo , Mingwu Zheng , Liang Hou , Yuan Gao , Yufan Deng , Pengfei Wan , Di Zhang , Yufan Liu , Weiming Hu , Zhengjun Zha , Haibin Huang , Chongyang Ma

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Jian Ma , Qirong Peng , Xu Guo , Chen Chen , Haonan Lu , Zhenyu Yang

SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-driven Video Editing

Text-to-Image (T2I) diffusion models have achieved remarkable success in synthesizing high-quality images conditioned on text prompts. Recent methods have tried to replicate the success by either training text-to-video (T2V) models on a…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Nazmul Karim , Umar Khalid , Mohsen Joneidi , Chen Chen , Nazanin Rahnavard

S2DiT: Sandwich Diffusion Transformer for Mobile Streaming Video Generation

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Lin Zhao , Yushu Wu , Aleksei Lebedev , Dishani Lahiri , Meng Dong , Arpit Sahni , Michael Vasilkovsky , Hao Chen , Ju Hu , Aliaksandr Siarohin , Sergey Tulyakov , Yanzhi Wang , Anil Kag , Yanyu Li

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Yixuan Ren , Yang Zhou , Jimei Yang , Jing Shi , Difan Liu , Feng Liu , Mingi Kwon , Abhinav Shrivastava

Bridging Text and Video Generation: A Survey

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating…

Graphics · Computer Science 2025-10-07 Nilay Kumar , Priyansh Bhandari , G. Maragatham

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Peng Gao , Le Zhuo , Dongyang Liu , Ruoyi Du , Xu Luo , Longtian Qiu , Yuhang Zhang , Chen Lin , Rongjie Huang , Shijie Geng , Renrui Zhang , Junlin Xi , Wenqi Shao , Zhengkai Jiang , Tianshuo Yang , Weicai Ye , He Tong , Jingwen He , Yu Qiao , Hongsheng Li

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Andreas Blattmann , Tim Dockhorn , Sumith Kulal , Daniel Mendelevitch , Maciej Kilian , Dominik Lorenz , Yam Levi , Zion English , Vikram Voleti , Adam Letts , Varun Jampani , Robin Rombach

KOALA: Empirical Lessons Toward Memory-Efficient and Fast Diffusion Models for Text-to-Image Synthesis

As text-to-image (T2I) synthesis models increase in size, they demand higher inference costs due to the need for more expensive GPUs with larger memory, which makes it challenging to reproduce these models in addition to the restricted…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Youngwan Lee , Kwanyong Park , Yoorhim Cho , Yong-Ju Lee , Sung Ju Hwang

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Weixi Feng , Xuehai He , Tsu-Jui Fu , Varun Jampani , Arjun Akula , Pradyumna Narayana , Sugato Basu , Xin Eric Wang , William Yang Wang

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable…

Computer Vision and Pattern Recognition · Computer Science 2024-02-09 Yuwei Guo , Ceyuan Yang , Anyi Rao , Zhengyang Liang , Yaohui Wang , Yu Qiao , Maneesh Agrawala , Dahua Lin , Bo Dai

TA-V2A: Textually Assisted Video-to-Audio Generation

As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While…

Computer Vision and Pattern Recognition · Computer Science 2025-03-17 Yuhuan You , Xihong Wu , Tianshu Qu