Related papers: Progress-Aware Video Frame Captioning

Accurate and Fast Compressed Video Captioning

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame…

Computer Vision and Pattern Recognition · Computer Science 2024-01-04 Yaojie Shen , Xin Gu , Kai Xu , Heng Fan , Longyin Wen , Libo Zhang

An Integrated Approach for Video Captioning and Applications

Physical computing infrastructure, data gathering, and algorithms have recently had significant advances to extract information from images and videos. The growth has been especially outstanding in image captioning and video captioning.…

Computer Vision and Pattern Recognition · Computer Science 2022-01-25 Soheyla Amirian , Thiab R. Taha , Khaled Rasheed , Hamid R. Arabnia

Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions

Video captioning aims to convey dynamic scenes from videos using natural language, facilitating the understanding of spatiotemporal information within our environment. Although there have been recent advances, generating detailed and…

Computer Vision and Pattern Recognition · Computer Science 2023-05-25 Jun Chen , Deyao Zhu , Kilichbek Haydarov , Xiang Li , Mohamed Elhoseiny

Move Forward and Tell: A Progressive Generator of Video Descriptions

We present an efficient framework that can generate a coherent paragraph to describe a given video. Previous works on video captioning usually focus on video clips. They typically treat an entire video as a whole and generate the caption…

Computer Vision and Pattern Recognition · Computer Science 2018-07-27 Yilei Xiong , Bo Dai , Dahua Lin

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Andrew Shin , Katsunori Ohnishi , Tatsuya Harada

Optimizing Latency for Online Video CaptioningUsing Audio-Visual Transformers

Video captioning is an essential technology to understand scenes and describe events in natural language. To apply it to real-time monitoring, a system needs not only to describe events accurately but also to produce the captions as soon as…

Computer Vision and Pattern Recognition · Computer Science 2021-08-05 Chiori Hori , Takaaki Hori , Jonathan Le Roux

SnapCap: Efficient Snapshot Compressive Video Captioning

Video Captioning (VC) is a challenging multi-modal task since it requires describing the scene in language by understanding various and complex videos. For machines, the traditional VC follows the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-11 Jianqiao Sun , Yudi Su , Hao Zhang , Ziheng Cheng , Zequn Zeng , Zhengjue Wang , Bo Chen , Xin Yuan

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

Generating automatic dense captions for videos that accurately describe their contents remains a challenging area of research. Most current models require processing the entire video at once. Instead, we propose an efficient, online…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 AJ Piergiovanni , Dahun Kim , Michael S. Ryoo , Isaac Noble , Anelia Angelova

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Jiayang Sun , Zixin Guo , Min Cao , Guibo Zhu , Jorma Laaksonen

Non-Autoregressive Coarse-to-Fine Video Captioning

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Bang Yang , Yuexian Zou , Fenglin Liu , Can Zhang

Video Captioning: a comparative review of where we are and which could be the route

Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Daniela Moctezuma , Tania Ramírez-delReal , Guillermo Ruiz , Othón González-Chávez

The Use of Video Captioning for Fostering Physical Activity

Video Captioning is considered to be one of the most challenging problems in the field of computer vision. Video Captioning involves the combination of different deep learning models to perform object detection, action detection, and…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 Soheyla Amirian , Abolfazl Farahani , Hamid R. Arabnia , Khaled Rasheed , Thiab R. Taha

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Mingkang Tang , Zhanyu Wang , Zhenhua Liu , Fengyun Rao , Dian Li , Xiu Li

Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description…

Computer Vision and Pattern Recognition · Computer Science 2016-10-19 Mihai Zanfir , Elisabeta Marinoiu , Cristian Sminchisescu

Less Is More: Picking Informative Frames for Video Captioning

In video captioning task, the best practice has been achieved by attention-based models which associate salient visual components with sentences in the video. However, existing study follows a common procedure which includes a frame-level…

Computer Vision and Pattern Recognition · Computer Science 2018-03-06 Yangyu Chen , Shuhui Wang , Weigang Zhang , Qingming Huang

Discriminative Latent Semantic Graph for Video Captioning

Video captioning aims to automatically generate natural language sentences that can describe the visual contents of a given video. Existing generative models like encoder-decoder frameworks cannot explicitly explore the object-level…

Computer Vision and Pattern Recognition · Computer Science 2021-08-11 Yang Bai , Junyan Wang , Yang Long , Bingzhang Hu , Yang Song , Maurice Pagnucco , Yu Guan

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Jonghwan Mun , Linjie Yang , Zhou Ren , Ning Xu , Bohyung Han

From Captions to Keyframes: KeyScore for Multimodal Frame Scoring and Video-Language Understanding

Selecting informative keyframes is critical for efficient video understanding, yet existing approaches often rely on heuristics, ignore semantics, or produce redundant frames. We propose KeyScore, a caption-aware frame scoring method that…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Shih-Yao Lin , Sibendu Paul , Caren Chen

Learning a Condensed Frame for Memory-Efficient Video Class-Incremental Learning

Recent incremental learning for action recognition usually stores representative videos to mitigate catastrophic forgetting. However, only a few bulky videos can be stored due to the limited memory. To address this problem, we propose…

Computer Vision and Pattern Recognition · Computer Science 2022-11-03 Yixuan Pei , Zhiwu Qing , Jun Cen , Xiang Wang , Shiwei Zhang , Yaxiong Wang , Mingqian Tang , Nong Sang , Xueming Qian

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack…

Computer Vision and Pattern Recognition · Computer Science 2024-01-10 Shih-Han Chou , James J. Little , Leonid Sigal