Related papers: InstanceCap: Improving Text-to-Video Generation vi…

InstanceV: Instance-Level Video Generation

Recent advances in text-to-video diffusion models have enabled the generation of high-quality videos conditioned on textual descriptions. However, most existing text-to-video models rely solely on textual conditions, lacking general…

Computer Vision and Pattern Recognition · Computer Science 2025-12-01 Yuheng Chen , Teng Hu , Jiangning Zhang , Zhucun Xue , Ran Yi , Lizhuang Ma

Enriching Video Captions With Contextual Text

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning…

Computer Vision and Pattern Recognition · Computer Science 2020-07-30 Philipp Rimle , Pelin Dogan , Markus Gross

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Mingkang Tang , Zhanyu Wang , Zhenhua Liu , Fengyun Rao , Dian Li , Xiu Li

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from…

Sound · Computer Science 2025-01-03 Yi Yuan , Dongya Jia , Xiaobin Zhuang , Yuanzhe Chen , Zhengxi Liu , Zhuo Chen , Yuping Wang , Yuxuan Wang , Xubo Liu , Xiyuan Kang , Mark D. Plumbley , Wenwu Wang

InstAP: Instance-Aware Vision-Language Pre-Train for Spatial-Temporal Understanding

Current vision-language pre-training (VLP) paradigms excel at global scene understanding but struggle with instance-level reasoning due to global-only supervision. We introduce InstAP, an Instance-Aware Pre-training framework that jointly…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Ashutosh Kumar , Rajat Saini , Jingjing Pan , Mustafa Erdogan , Mingfang Zhang , Betty Le Dem , Norimasa Kobori , Quan Kong

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack…

Computer Vision and Pattern Recognition · Computer Science 2024-01-10 Shih-Han Chou , James J. Little , Leonid Sigal

CodecCap: High-Fidelity Codec-Inspired Residual Modeling for Dense Video Captioning

Existing video captioning methods struggle to balance visual fidelity and redundancy: holistic captions are compact but lose fine-grained evidence, whereas segment-wise captions improve coverage but introduce heavy redundancy. We propose…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Zihan Lin , Songhe Deng , Shuwei He , Danxiang Zhu , Dan Zhang , Yishu Lei , Xianlong Luo , Shikun Feng , Rui Liu

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation

The training of controllable text-to-video (T2V) models relies heavily on the alignment between videos and captions, yet little existing research connects video caption evaluation with T2V generation assessment. This paper introduces…

Artificial Intelligence · Computer Science 2025-05-20 Xinlong Chen , Yuanxing Zhang , Chongling Rao , Yushuo Guan , Jiaheng Liu , Fuzheng Zhang , Chengru Song , Qiang Liu , Di Zhang , Tieniu Tan

Cap2Sum: Learning to Summarize Videos by Generating Captions

With the rapid growth of video data on the internet, video summarization is becoming a very important AI technology. However, due to the high labelling cost of video summarization, existing studies have to be conducted on small-scale…

Multimedia · Computer Science 2026-01-13 Cairong Zhao , Chutian Wang , Zifan Song , Guosheng Hu , Haonan Chen , Xiaofan Zhai

Models See Hallucinations: Evaluating the Factuality in Video Captioning

Video captioning aims to describe events in a video with natural language. In recent years, many works have focused on improving captioning models' performance. However, like other text generation tasks, it risks introducing factual errors…

Computer Vision and Pattern Recognition · Computer Science 2023-03-07 Hui Liu , Xiaojun Wan

Progress-Aware Video Frame Captioning

While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Zihui Xue , Joungbin An , Xitong Yang , Kristen Grauman

VC4VG: Optimizing Video Captions for Text-to-Video Generation

Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Yang Du , Zhuoran Lin , Kaiqiang Song , Biao Wang , Zhicheng Zheng , Tiezheng Ge , Bo Zheng , Qin Jin

Consistent Video Instance Segmentation with Inter-Frame Recurrent Attention

Video instance segmentation aims at predicting object segmentation masks for each frame, as well as associating the instances across multiple frames. Recent end-to-end video instance segmentation methods are capable of performing object…

Computer Vision and Pattern Recognition · Computer Science 2022-06-15 Quanzeng You , Jiang Wang , Peng Chu , Andre Abrantes , Zicheng Liu

VIVECaption: A Split Approach to Caption Quality Improvement

Caption quality has emerged as a critical bottleneck in training high-quality text-to-image (T2I) and text-to-video (T2V) generative models. While visual language models (VLMs) are commonly deployed to generate captions from visual data,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Varun Ananth , Baqiao Liu , Haoran Cai

Sequence to Sequence -- Video to Text

Real-world videos often have complex dynamics; and methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length.…

Computer Vision and Pattern Recognition · Computer Science 2015-10-20 Subhashini Venugopalan , Marcus Rohrbach , Jeff Donahue , Raymond Mooney , Trevor Darrell , Kate Saenko

VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling

VAR is a new generation paradigm that employs 'next-scale prediction' as opposed to 'next-token prediction'. This innovative transformation enables auto-regressive (AR) transformers to rapidly learn visual distributions and achieve robust…

Computer Vision and Pattern Recognition · Computer Science 2024-08-05 Qian Zhang , Xiangzi Dai , Ninghua Yang , Xiang An , Ziyong Feng , Xingyu Ren

GLaVE-Cap: Global-Local Aligned Video Captioning with Vision Expert Integration

Video detailed captioning aims to generate comprehensive video descriptions to facilitate video understanding. Recently, most efforts in the video detailed captioning community have been made towards a local-to-global paradigm, which first…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Wan Xu , Feng Zhu , Yihan Zeng , Yuanfan Guo , Ming Liu , Hang Xu , Wangmeng Zuo

TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation

Most of these text-to-video (T2V) generative models often produce single-scene video clips that depict an entity performing a particular action (e.g., 'a red panda climbing a tree'). However, it is pertinent to generate multi-scene videos…

Computer Vision and Pattern Recognition · Computer Science 2024-11-11 Hritik Bansal , Yonatan Bitton , Michal Yarom , Idan Szpektor , Aditya Grover , Kai-Wei Chang

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Andrew Shin , Katsunori Ohnishi , Tatsuya Harada

CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter

For video captioning, "pre-training and fine-tuning" has become a de facto paradigm, where ImageNet Pre-training (INP) is usually used to encode the video content, then a task-oriented network is fine-tuned from scratch to cope with caption…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Bang Yang , Tong Zhang , Yuexian Zou