Related papers: Multi-modal Video Chapter Generation

VidChapters-7M: Video Chapters at Scale

Segmenting long videos into chapters enables users to quickly navigate to the information of their interest. This important topic has been understudied due to the lack of publicly released datasets. To address this issue, we present…

Computer Vision and Pattern Recognition · Computer Science 2023-09-26 Antoine Yang , Arsha Nagrani , Ivan Laptev , Josef Sivic , Cordelia Schmid

Visual Subtitle Feature Enhanced Video Outline Generation

With the tremendously increasing number of videos, there is a great demand for techniques that help people quickly navigate to the video segments they are interested in. However, current works on video understanding mainly focus on video…

Computer Vision and Pattern Recognition · Computer Science 2022-09-02 Qi Lv , Ziqiang Cao , Wenrui Xie , Derui Wang , Jingwen Wang , Zhiwei Hu , Tangkun Zhang , Ba Yuan , Yuanhang Li , Min Cao , Wenjie Li , Sujian Li , Guohong Fu

Multi-sentence Video Grounding for Long Video Generation

Video generation has witnessed great success recently, but their application in generating long videos still remains challenging due to the difficulty in maintaining the temporal consistency of generated videos and the high memory cost…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Wei Feng , Xin Wang , Hong Chen , Zeyang Zhang , Wenwu Zhu

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient…

Computer Vision and Pattern Recognition · Computer Science 2023-06-02 Jinbo Xing , Menghan Xia , Yuxin Liu , Yuechen Zhang , Yong Zhang , Yingqing He , Hanyuan Liu , Haoxin Chen , Xiaodong Cun , Xintao Wang , Ying Shan , Tien-Tsin Wong

VTC: Improving Video-Text Retrieval with User Comments

Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Laura Hanu , James Thewlis , Yuki M. Asano , Christian Rupprecht

Video Generation Beyond a Single Clip

We tackle the long video generation problem, i.e.~generating videos beyond the output length of video generation models. Due to the computation resource constraints, video generation models can only generate video clips that are relatively…

Computer Vision and Pattern Recognition · Computer Science 2023-04-18 Hsin-Ping Huang , Yu-Chuan Su , Ming-Hsuan Yang

Frame-Level Captions for Long Video Generation with Complex Multi Scenes

Generating long videos that can show complex stories, like movie scenes from scripts, has great promise and offers much more than short clips. However, current methods that use autoregression with diffusion models often struggle because…

Computer Vision and Pattern Recognition · Computer Science 2025-05-28 Guangcong Zheng , Jianlong Yuan , Bo Wang , Haoyang Huang , Guoqing Ma , Nan Duan

Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation

Generating videos for visual storytelling can be a tedious and complex process that typically requires either live-action filming or graphics animation rendering. To bypass these challenges, our key idea is to utilize the abundance of…

Computer Vision and Pattern Recognition · Computer Science 2023-07-14 Yingqing He , Menghan Xia , Haoxin Chen , Xiaodong Cun , Yuan Gong , Jinbo Xing , Yong Zhang , Xintao Wang , Chao Weng , Ying Shan , Qifeng Chen

A Local-to-Global Approach to Multi-modal Movie Scene Segmentation

Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of…

Computer Vision and Pattern Recognition · Computer Science 2020-04-29 Anyi Rao , Linning Xu , Yu Xiong , Guodong Xu , Qingqiu Huang , Bolei Zhou , Dahua Lin

Video Storytelling: Textual Summaries for Events

Bridging vision and natural language is a longstanding goal in computer vision and multimedia research. While earlier works focus on generating a single-sentence description for visual content, recent works have studied paragraph…

Multimedia · Computer Science 2020-05-15 Junnan Li , Yongkang Wong , Qi Zhao , Mohan S. Kankanhalli

UltraGen: High-Resolution Video Generation with Hierarchical Attention

Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based…

Computer Vision and Pattern Recognition · Computer Science 2025-10-22 Teng Hu , Jiangning Zhang , Zihan Su , Ran Yi

MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility,…

Computer Vision and Pattern Recognition · Computer Science 2023-11-21 Jielin Qiu , Jiacheng Zhu , William Han , Aditesh Kumar , Karthik Mittal , Claire Jin , Zhengyuan Yang , Linjie Li , Jianfeng Wang , Ding Zhao , Bo Li , Lijuan Wang

Move Forward and Tell: A Progressive Generator of Video Descriptions

We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Yilei Xiong , Bo Dai , Dahua Lin

GEM-VPC: A dual Graph-Enhanced Multimodal integration for Video Paragraph Captioning

Video Paragraph Captioning (VPC) aims to generate paragraph captions that summarises key events within a video. Despite recent advancements, challenges persist, notably in effectively utilising multimodal signals inherent in videos and…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 Eileen Wang , Caren Han , Josiah Poon

A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

This paper proposes a practical multimodal video summarization task setting and a dataset to train and evaluate the task. The target task involves summarizing a given video into a predefined number of keyframe-caption pairs and displaying…

Computation and Language · Computer Science 2023-12-05 Keito Kudo , Haruki Nagasawa , Jun Suzuki , Nobuyuki Shimizu

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning on Language-Video Foundation Models

Customized text-to-video generation aims to generate text-guided videos with user-given subjects, which has gained increasing attention. However, existing works are primarily limited to single-subject oriented text-to-video generation,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Hong Chen , Xin Wang , Guanning Zeng , Yipeng Zhang , Yuwei Zhou , Feilin Han , Yaofei Wu , Wenwu Zhu

Title Generation for User Generated Videos

A great video title describes the most salient event compactly and captures the viewer's attention. In contrast, video captioning tends to generate sentences that describe the video as a whole. Although generating a video title…

Computer Vision and Pattern Recognition · Computer Science 2016-09-09 Kuo-Hao Zeng , Tseng-Hung Chen , Juan Carlos Niebles , Min Sun

USV: Towards Understanding the User-generated Short-form Videos

Several large-scale video datasets have been published these years and have advanced the area of video understanding. However, the newly emerged user-generated short-form videos have rarely been studied. This paper presents USV, the…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Haoyue Cheng , Su Xu , Liwei Jin , Wayne Wu , Chen Qian , Limin Wang

BulletGen: Improving 4D Reconstruction with Bullet-Time Generation

Transforming casually captured, monocular videos into fully immersive dynamic experiences is a highly ill-posed task, and comes with significant challenges, e.g., reconstructing unseen regions, and dealing with the ambiguity in monocular…

Graphics · Computer Science 2026-04-08 Denis Rozumny , Jonathon Luiten , Numair Khan , Johannes Schönberger , Peter Kontschieder

Video-guided Machine Translation with Global Video Context

Video-guided Multimodal Translation (VMT) has advanced significantly in recent years. However, most existing methods rely on locally aligned video segments paired one-to-one with subtitles, limiting their ability to capture global narrative…

Computer Vision and Pattern Recognition · Computer Science 2026-04-09 Jian Chen , JinZe Lv , Zi Long , XiangHua Fu