Related papers: LatentMan: Generating Consistent Animated Characte…

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

Large-scale text-to-video (T2V) diffusion models have great progress in recent years in terms of visual quality, motion and temporal consistency. However, the generation process is still a black box, where all attributes (e.g., appearance,…

Computer Vision and Pattern Recognition · Computer Science 2023-12-08 Jiwen Yu , Xiaodong Cun , Chenyang Qi , Yong Zhang , Xintao Wang , Ying Shan , Jian Zhang

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators

Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Levon Khachatryan , Andranik Movsisyan , Vahram Tadevosyan , Roberto Henschel , Zhangyang Wang , Shant Navasardyan , Humphrey Shi

LatentWarp: Consistent Diffusion Latents for Zero-Shot Video-to-Video Translation

Leveraging the generative ability of image diffusion models offers great potential for zero-shot video-to-video translation. The key lies in how to maintain temporal consistency across generated video frames by image diffusion models.…

Computer Vision and Pattern Recognition · Computer Science 2023-11-02 Yuxiang Bao , Di Qiu , Guoliang Kang , Baochang Zhang , Bo Jin , Kaiye Wang , Pengfei Yan

LatentMove: Towards Complex Human Movement Video Generation

Image-to-video (I2V) generation seeks to produce realistic motion sequences from a single reference image. Although recent methods exhibit strong temporal consistency, they often struggle when dealing with complex, non-repetitive human…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Ashkan Taghipour , Morteza Ghahremani , Mohammed Bennamoun , Farid Boussaid , Aref Miri Rekavandi , Zinuo Li , Qiuhong Ke , Hamid Laga

DirecT2V: Large Language Models are Frame-Level Directors for Zero-Shot Text-to-Video Generation

In the paradigm of AI-generated content (AIGC), there has been increasing attention to transferring knowledge from pre-trained text-to-image (T2I) models to text-to-video (T2V) generation. Despite their effectiveness, these frameworks face…

Computer Vision and Pattern Recognition · Computer Science 2024-02-07 Susung Hong , Junyoung Seo , Heeseong Shin , Sunghwan Hong , Seungryong Kim

Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation

We propose Latent-Shift -- an efficient text-to-video generation method based on a pretrained text-to-image generation model that consists of an autoencoder and a U-Net diffusion model. Learning a video diffusion model in the latent space…

Computer Vision and Pattern Recognition · Computer Science 2023-04-19 Jie An , Songyang Zhang , Harry Yang , Sonal Gupta , Jia-Bin Huang , Jiebo Luo , Xi Yin

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Li Hu , Xin Gao , Peng Zhang , Ke Sun , Bang Zhang , Liefeng Bo

HARIVO: Harnessing Text-to-Image Models for Video Generation

We present a method to create diffusion-based video models from pretrained Text-to-Image (T2I) models. Recently, AnimateDiff proposed freezing the T2I model while only training temporal layers. We advance this method by proposing a unique…

Computer Vision and Pattern Recognition · Computer Science 2024-10-11 Mingi Kwon , Seoung Wug Oh , Yang Zhou , Difan Liu , Joon-Young Lee , Haoran Cai , Baqiao Liu , Feng Liu , Youngjung Uh

Dancing Avatar: Pose and Text-Guided Human Motion Videos Synthesis with Image Diffusion Model

The rising demand for creating lifelike avatars in the digital realm has led to an increased need for generating high-quality human videos guided by textual descriptions and poses. We propose Dancing Avatar, designed to fabricate human…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Bosheng Qin , Wentao Ye , Qifan Yu , Siliang Tang , Yueting Zhuang

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Text-to-video diffusion models enable the generation of high-quality videos that follow text instructions, making it easy to create diverse and individual content. However, existing approaches mostly focus on high-quality short video…

Computer Vision and Pattern Recognition · Computer Science 2025-04-17 Roberto Henschel , Levon Khachatryan , Hayk Poghosyan , Daniil Hayrapetyan , Vahram Tadevosyan , Zhangyang Wang , Shant Navasardyan , Humphrey Shi

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution…

Computer Vision and Pattern Recognition · Computer Science 2023-12-29 Andreas Blattmann , Robin Rombach , Huan Ling , Tim Dockhorn , Seung Wook Kim , Sanja Fidler , Karsten Kreis

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-20 Jay Zhangjie Wu , Yixiao Ge , Xintao Wang , Weixian Lei , Yuchao Gu , Yufei Shi , Wynne Hsu , Ying Shan , Xiaohu Qie , Mike Zheng Shou

Training-Free Sketch-Guided Diffusion with Latent Optimization

Based on recent advanced diffusion models, Text-to-image (T2I) generation models have demonstrated their capabilities to generate diverse and high-quality images. However, leveraging their potential for real-world content creation,…

Computer Vision and Pattern Recognition · Computer Science 2025-05-08 Sandra Zhang Ding , Jiafeng Mao , Kiyoharu Aizawa

TokenFlow: Consistent Diffusion Features for Consistent Video Editing

The generative AI revolution has recently expanded to videos. Nevertheless, current state-of-the-art video models are still lagging behind image models in terms of visual quality and user control over the generated content. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2023-11-21 Michal Geyer , Omer Bar-Tal , Shai Bagon , Tali Dekel

Video Text Preservation with Synthetic Text-Rich Videos

While Text-To-Video (T2V) models have advanced rapidly, they continue to struggle with generating legible and coherent text within videos. In particular, existing models often fail to render correctly even short phrases or words and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Ziyang Liu , Kevin Valencia , Justin Cui

Noise Crystallization and Liquid Noise: Zero-shot Video Generation using Image Diffusion Models

Although powerful for image generation, consistent and controllable video is a longstanding problem for diffusion models. Video models require extensive training and computational resources, leading to high costs and large environmental…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Muhammad Haaris Khan , Hadrien Reynaud , Bernhard Kainz

Synthesizing Long-Term Human Motions with Diffusion Models via Coherent Sampling

Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions that correspond to a single sentence describing a single action. However, when a text stream describes a…

Computer Vision and Pattern Recognition · Computer Science 2023-08-04 Zhao Yang , Bing Su , Ji-Rong Wen

Latent Video Diffusion Models for High-Fidelity Long Video Generation

AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length…

Computer Vision and Pattern Recognition · Computer Science 2023-03-21 Yingqing He , Tianyu Yang , Yong Zhang , Ying Shan , Qifeng Chen

T2Bs: Text-to-Character Blendshapes via Video Generation

We present T2Bs, a framework for generating high-quality, animatable character head morphable models from text by combining static text-to-3D generation with video diffusion. Text-to-3D models produce detailed static geometry but lack…

Graphics · Computer Science 2025-09-30 Jiahao Luo , Chaoyang Wang , Michael Vasilkovsky , Vladislav Shakhrai , Di Liu , Peiye Zhuang , Sergey Tulyakov , Peter Wonka , Hsin-Ying Lee , James Davis , Jian Wang

TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require…

Computer Vision and Pattern Recognition · Computer Science 2024-04-26 Haomiao Ni , Bernhard Egger , Suhas Lohit , Anoop Cherian , Ye Wang , Toshiaki Koike-Akino , Sharon X. Huang , Tim K. Marks