Related papers: Video Captioning Using Weak Annotation

Weakly Supervised Dense Video Captioning

This paper focuses on a novel and challenging vision task, dense video captioning, which aims to automatically describe a video clip with multiple informative and diverse caption sentences. The proposed method is trained without explicit…

Computer Vision and Pattern Recognition · Computer Science 2017-04-06 Zhiqiang Shen , Jianguo Li , Zhou Su , Minjun Li , Yurong Chen , Yu-Gang Jiang , Xiangyang Xue

Weakly Supervised Dense Event Captioning in Videos

Dense event captioning aims to detect and describe all events of interest contained in a video. Despite the advanced development in this area, existing methods tackle this task by making use of dense temporal annotations, which is…

Computer Vision and Pattern Recognition · Computer Science 2018-12-11 Xuguang Duan , Wenbing Huang , Chuang Gan , Jingdong Wang , Wenwu Zhu , Junzhou Huang

Streamlined Dense Video Captioning

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing…

Computer Vision and Pattern Recognition · Computer Science 2019-04-09 Jonghwan Mun , Linjie Yang , Zhou Ren , Ning Xu , Bohyung Han

Weakly Supervised Dense Video Captioning via Jointly Usage of Knowledge Distillation and Cross-modal Matching

This paper proposes an approach to Dense Video Captioning (DVC) without pairwise event-sentence annotation. First, we adopt the knowledge distilled from relevant and well solved tasks to generate high-quality event proposals. Then we…

Computer Vision and Pattern Recognition · Computer Science 2021-05-19 Bofeng Wu , Guocheng Niu , Jun Yu , Xinyan Xiao , Jian Zhang , Hua Wu

A Semantics-Assisted Video Captioning Model Trained with Scheduled Sampling

Given the features of a video, recurrent neural networks can be used to automatically generate a caption for the video. Existing methods for video captioning have at least three limitations. First, semantic information has been widely…

Computer Vision and Pattern Recognition · Computer Science 2021-02-15 Haoran Chen , Ke Lin , Alexander Maye , Jianming Li , Xiaolin Hu

Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning

Video captioning generate a sentence that describes the video content. Existing methods always require a number of captions (\eg, 10 or 20) per video to train the model, which is quite costly. In this work, we explore the possibility of…

Computer Vision and Pattern Recognition · Computer Science 2024-11-07 Ping Li , Tao Wang , Xinkui Zhao , Xianghua Xu , Mingli Song

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Andrew Shin , Katsunori Ohnishi , Tatsuya Harada

Non-Autoregressive Coarse-to-Fine Video Captioning

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Bang Yang , Yuexian Zou , Fenglin Liu , Can Zhang

Video Captioning via Hierarchical Reinforcement Learning

Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short…

Computer Vision and Pattern Recognition · Computer Science 2018-03-30 Xin Wang , Wenhu Chen , Jiawei Wu , Yuan-Fang Wang , William Yang Wang

Grounded Video Description

Video description is one of the most challenging problems in vision and language understanding due to the large variability both on the video and language side. Models, hence, typically shortcut the difficulty in recognition and generate…

Computer Vision and Pattern Recognition · Computer Science 2019-05-07 Luowei Zhou , Yannis Kalantidis , Xinlei Chen , Jason J. Corso , Marcus Rohrbach

Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically…

Computer Vision and Pattern Recognition · Computer Science 2020-01-28 Zhenfang Chen , Lin Ma , Wenhan Luo , Peng Tang , Kwan-Yee K. Wong

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack…

Computer Vision and Pattern Recognition · Computer Science 2024-01-10 Shih-Han Chou , James J. Little , Leonid Sigal

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the content of an input video is a very challenging problem. It is an essential multimodal task in which auditory and visual contents are equally important. Although audio…

Computer Vision and Pattern Recognition · Computer Science 2018-12-10 Yapeng Tian , Chenxiao Guan , Justin Goodman , Marc Moore , Chenliang Xu

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Mingkang Tang , Zhanyu Wang , Zhenhua Liu , Fengyun Rao , Dian Li , Xiu Li

Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhan Shi , Xu Zhou , Xipeng Qiu , Xiaodan Zhu

Image Captioning

This paper discusses and demonstrates the outcomes from our experimentation on Image Captioning. Image captioning is a much more involved task than image recognition or classification, because of the additional challenge of recognizing the…

Computer Vision and Pattern Recognition · Computer Science 2018-05-24 Vikram Mullachery , Vishal Motwani

Attentive Semantic Video Generation using Captions

This paper proposes a network architecture to perform variable length semantic video generation using captions. We adopt a new perspective towards video generation where we allow the captions to be combined with the long-term and short-term…

Computer Vision and Pattern Recognition · Computer Science 2017-11-17 Tanya Marwah , Gaurav Mittal , Vineeth N. Balasubramanian

Deep Learning for Video Classification and Captioning

Accelerated by the tremendous increase in Internet bandwidth and storage space, video data has been generated, published and spread explosively, becoming an indispensable part of today's big data. In this paper, we focus on reviewing two…

Computer Vision and Pattern Recognition · Computer Science 2018-02-23 Zuxuan Wu , Ting Yao , Yanwei Fu , Yu-Gang Jiang

Enriching Video Captions With Contextual Text

Understanding video content and generating caption with context is an important and challenging task. Unlike prior methods that typically attempt to generate generic video captions without context, our architecture contextualizes captioning…

Computer Vision and Pattern Recognition · Computer Science 2020-07-30 Philipp Rimle , Pelin Dogan , Markus Gross

End-to-end Dense Video Captioning as Sequence Generation

Dense video captioning aims to identify the events of interest in an input video, and generate descriptive captions for each event. Previous approaches usually follow a two-stage generative process, which first proposes a segment for each…

Computer Vision and Pattern Recognition · Computer Science 2022-09-19 Wanrong Zhu , Bo Pang , Ashish V. Thapliyal , William Yang Wang , Radu Soricut