English
Related papers

Related papers: SimDA: Simple Diffusion Adapter for Efficient Vide…

200 papers

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-20 Jay Zhangjie Wu , Yixiao Ge , Xintao Wang , Weixian Lei , Yuchao Gu , Yufei Shi , Wynne Hsu , Ying Shan , Xiaohu Qie , Mike Zheng Shou

Diffusion models (DMs) have recently achieved impressive photorealism in image and video generation. However, their application to image animation remains limited, even when trained on large-scale datasets. Two primary challenges contribute…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Zhenhao Li , Shaohan Yi , Zheng Liu , Leonartinus Gao , Minh Ngoc Le , Ambrose Ling , Zhuoran Wang , Md Amirul Islam , Zhixiang Chi , Yuanhao Yu

Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Yangfan He , Sida Li , Jianhui Wang , Kun Li , Xinyuan Song , Xinhang Yuan , Keqin Li , Kuan Lu , Menghao Huo , Jingqun Tang , Yi Xin , Jiaqi Chen , Miao Zhang , Xueqian Wang

While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Ziyang Liu , Kevin Valencia , Justin Cui

The most advanced text-to-image (T2I) models require significant training costs (e.g., millions of GPU hours), seriously hindering the fundamental innovation for the AIGC community while increasing CO2 emissions. This paper introduces…

Computer Vision and Pattern Recognition · Computer Science 2024-01-01 Junsong Chen , Jincheng Yu , Chongjian Ge , Lewei Yao , Enze Xie , Yue Wu , Zhongdao Wang , James Kwok , Ping Luo , Huchuan Lu , Zhenguo Li

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM)…

Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Roberto Henschel , Levon Khachatryan , Hayk Poghosyan , Daniil Hayrapetyan , Vahram Tadevosyan , Zhangyang Wang , Shant Navasardyan , Humphrey Shi

Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient…

Computer Vision and Pattern Recognition · Computer Science 2025-10-29 Peiyuan Zhang , Yongqi Chen , Haofeng Huang , Will Lin , Zhengzhong Liu , Ion Stoica , Eric Xing , Hao Zhang

Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V)…

Computer Vision and Pattern Recognition · Computer Science 2024-06-28 Xun Guo , Mingwu Zheng , Liang Hou , Yuan Gao , Yufan Deng , Pengfei Wan , Di Zhang , Yufan Liu , Weiming Hu , Zhengjun Zha , Haibin Huang , Chongyang Ma

Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Jian Ma , Qirong Peng , Xu Guo , Chen Chen , Haonan Lu , Zhenyu Yang

Text-to-Image (T2I) diffusion models have achieved remarkable success in synthesizing high-quality images conditioned on text prompts. Recent methods have tried to replicate the success by either training text-to-video (T2V) models on a…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Nazmul Karim , Umar Khalid , Mohsen Joneidi , Chen Chen , Nazanin Rahnavard

Diffusion Transformers (DiTs) have recently improved video generation quality. However, their heavy computational cost makes real-time or on-device generation infeasible. In this work, we introduce S2DiT, a Streaming Sandwich Diffusion…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Lin Zhao , Yushu Wu , Aleksei Lebedev , Dishani Lahiri , Meng Dong , Arpit Sahni , Michael Vasilkovsky , Hao Chen , Ju Hu , Aliaksandr Siarohin , Sergey Tulyakov , Yanzhi Wang , Anil Kag , Yanyu Li

Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Yixuan Ren , Yang Zhou , Jimei Yang , Jing Shi , Difan Liu , Feng Liu , Mingi Kwon , Abhinav Shrivastava

Text-to-video (T2V) generation technology holds potential to transform multiple domains such as education, marketing, entertainment, and assistive technologies for individuals with visual or reading comprehension challenges, by creating…

Graphics · Computer Science 2025-10-07 Nilay Kumar , Priyansh Bhandari , G. Maragatham

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details. In this technical…

We present Stable Video Diffusion - a latent video diffusion model for high-resolution, state-of-the-art text-to-video and image-to-video generation. Recently, latent diffusion models trained for 2D image synthesis have been turned into…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Andreas Blattmann , Tim Dockhorn , Sumith Kulal , Daniel Mendelevitch , Maciej Kilian , Dominik Lorenz , Yam Levi , Zion English , Vikram Voleti , Adam Letts , Varun Jampani , Robin Rombach

As text-to-image (T2I) synthesis models increase in size, they demand higher inference costs due to the need for more expensive GPUs with larger memory, which makes it challenging to reproduce these models in addition to the restricted…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Youngwan Lee , Kwanyong Park , Yoorhim Cho , Yong-Ju Lee , Sung Ju Hwang

Large-scale diffusion models have achieved state-of-the-art results on text-to-image synthesis (T2I) tasks. Despite their ability to generate high-quality yet creative images, we observe that attribution-binding and compositional…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Weixi Feng , Xuehai He , Tsu-Jui Fu , Varun Jampani , Arjun Akula , Pradyumna Narayana , Sugato Basu , Xin Eric Wang , William Yang Wang

With the advance of text-to-image (T2I) diffusion models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable…

Computer Vision and Pattern Recognition · Computer Science 2024-02-09 Yuwei Guo , Ceyuan Yang , Anyi Rao , Zhengyang Liang , Yaohui Wang , Yu Qiao , Maneesh Agrawala , Dahua Lin , Bo Dai

As artificial intelligence-generated content (AIGC) continues to evolve, video-to-audio (V2A) generation has emerged as a key area with promising applications in multimedia editing, augmented reality, and automated content creation. While…

Computer Vision and Pattern Recognition · Computer Science 2025-03-17 Yuhuan You , Xihong Wu , Tianshu Qu
‹ Prev 1 2 3 10 Next ›