InstanceV: Instance-Level Video Generation

Yuheng Chen; Teng Hu; Jiangning Zhang; Zhucun Xue; Ran Yi; Lizhuang Ma

InstanceV: Instance-Level Video Generation

Computer Vision and Pattern Recognition 2025-12-01 v1

Authors: Yuheng Chen , Teng Hu , Jiangning Zhang , Zhucun Xue , Ran Yi , Lizhuang Ma

Abstract

Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general fine-grained controllability over video generation. To address this challenge, we propose InstanceV, a video generation framework that enables i) instance-level control and ii) global semantic consistency. Specifically, with the aid of proposed Instance-aware Masked Cross-Attention mechanism, InstanceV maximizes the utilization of additional instance-level grounding information to generate correctly attributed instances at designated spatial locations. To improve overall consistency, We introduce the Shared Timestep-Adaptive Prompt Enhancement module, which connects local instances with global semantics in a parameter-efficient manner. Furthermore, we incorporate Spatially-Aware Unconditional Guidance during both training and inference to alleviate the disappearance of small instances. Finally, we propose a new benchmark, named InstanceBench, which combines general video quality metrics with instance-aware metrics for more comprehensive evaluation on instance-level video generation. Extensive experiments demonstrate that InstanceV not only achieves remarkable instance-level controllability in video generation, but also outperforms existing state-of-the-art models in both general quality and instance-aware metrics across qualitative and quantitative evaluations.

Keywords

video generation video understanding video retrieval

Cite

@article{arxiv.2511.23146,
  title  = {InstanceV: Instance-Level Video Generation},
  author = {Yuheng Chen and Teng Hu and Jiangning Zhang and Zhucun Xue and Ran Yi and Lizhuang Ma},
  journal= {arXiv preprint arXiv:2511.23146},
  year   = {2025}
}

InstanceV: Instance-Level Video Generation

Abstract

Keywords

Cite

Related papers