Related papers: Multi-Modal interpretable automatic video captioni…

An Attempt towards Interpretable Audio-Visual Video Captioning

Automatically generating a natural language sentence to describe the content of an input video is a very challenging problem. It is an essential multimodal task in which auditory and visual contents are equally important. Although audio…

Computer Vision and Pattern Recognition · Computer Science 2018-12-10 Yapeng Tian , Chenxiao Guan , Justin Goodman , Marc Moore , Chenliang Xu

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit…

Computer Vision and Pattern Recognition · Computer Science 2020-07-22 Valentin Gabeur , Chen Sun , Karteek Alahari , Cordelia Schmid

Multi-modal Dense Video Captioning

Dense video captioning is a task of localizing interesting events from an untrimmed video and producing textual description (captions) for each localized event. Most of the previous works in dense video captioning are solely based on visual…

Computer Vision and Pattern Recognition · Computer Science 2020-05-07 Vladimir Iashin , Esa Rahtu

Video Captioning with Multi-Faceted Attention

Recently, video captioning has been attracting an increasing amount of interest, due to its potential for improving accessibility and information retrieval. While existing methods rely on different kinds of visual features and model…

Computer Vision and Pattern Recognition · Computer Science 2016-12-02 Xiang Long , Chuang Gan , Gerard de Melo

VATEX Captioning Challenge 2019: Multi-modal Information Fusion and Multi-stage Training Strategy for Video Captioning

Multi-modal information is essential to describe what has happened in a video. In this work, we represent videos by various appearance, motion and audio information guided with video topic. By following multi-stage training strategy, our…

Computation and Language · Computer Science 2019-10-15 Ziqi Zhang , Yaya Shi , Jiutong Wei , Chunfeng Yuan , Bing Li , Weiming Hu

Cross-Modal Graph with Meta Concepts for Video Captioning

Video captioning targets interpreting the complex visual contents as text descriptions, which requires the model to fully understand video scenes including objects and their interactions. Prevailing methods adopt off-the-shelf object…

Computer Vision and Pattern Recognition · Computer Science 2022-09-07 Hao Wang , Guosheng Lin , Steven C. H. Hoi , Chunyan Miao

Multimodal Memory Modelling for Video Captioning

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and…

Computer Vision and Pattern Recognition · Computer Science 2016-11-18 Junbo Wang , Wei Wang , Yan Huang , Liang Wang , Tieniu Tan

Diverse Video Captioning by Adaptive Spatio-temporal Attention

To generate proper captions for videos, the inference needs to identify relevant concepts and pay attention to the spatial relationships between them as well as to the temporal development in the clip. Our end-to-end encoder-decoder video…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Zohreh Ghaderi , Leonard Salewski , Hendrik P. A. Lensch

Delving Deeper into the Decoder for Video Captioning

Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence. The encoder-decoder framework is the most popular paradigm for this task in recent years. However, there exist some…

Computer Vision and Pattern Recognition · Computer Science 2021-02-15 Haoran Chen , Jianmin Li , Xiaolin Hu

Video Captioning: a comparative review of where we are and which could be the route

Video captioning is the process of describing the content of a sequence of images capturing its semantic relationships and meanings. Dealing with this task with a single image is arduous, not to mention how difficult it is for a video (or…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Daniela Moctezuma , Tania Ramírez-delReal , Guillermo Ruiz , Othón González-Chávez

Video Captioning via Hierarchical Reinforcement Learning

Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short…

Computer Vision and Pattern Recognition · Computer Science 2018-03-30 Xin Wang , Wenhu Chen , Jiawei Wu , Yuan-Fang Wang , William Yang Wang

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different…

Computation and Language · Computer Science 2018-04-17 Xin Wang , Yuan-Fang Wang , William Yang Wang

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image…

Computer Vision and Pattern Recognition · Computer Science 2016-05-19 Andrew Shin , Katsunori Ohnishi , Tatsuya Harada

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based…

Computer Vision and Pattern Recognition · Computer Science 2019-05-21 Jun Yu , Jing Li , Zhou Yu , Qingming Huang

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-30 Xubo Liu , Qiushi Huang , Xinhao Mei , Haohe Liu , Qiuqiang Kong , Jianyuan Sun , Shengchen Li , Tom Ko , Yu Zhang , Lilian H. Tang , Mark D. Plumbley , Volkan Kılıç , Wenwu Wang

Automated Audio Captioning: An Overview of Recent Progress and New Challenges

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-28 Xinhao Mei , Xubo Liu , Mark D. Plumbley , Wenwu Wang

Contrastive Graph Multimodal Model for Text Classification in Videos

The extraction of text information in videos serves as a critical step towards semantic understanding of videos. It usually involved in two steps: (1) text recognition and (2) text classification. To localize texts in videos, we can resort…

Computer Vision and Pattern Recognition · Computer Science 2022-06-07 Ye Liu , Changchong Lu , Chen Lin , Di Yin , Bo Ren

Spatio-Temporal Attention Models for Grounded Video Captioning

Automatic video captioning is challenging due to the complex interactions in dynamic real scenes. A comprehensive system would ultimately localize and track the objects, actions and interactions present in a video and generate a description…

Computer Vision and Pattern Recognition · Computer Science 2016-10-19 Mihai Zanfir , Elisabeta Marinoiu , Cristian Sminchisescu

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood…

Computer Vision and Pattern Recognition · Computer Science 2023-02-17 Kashu Yamazaki , Khoa Vo , Sang Truong , Bhiksha Raj , Ngan Le

Finding It at Another Side: A Viewpoint-Adapted Matching Encoder for Change Captioning

Change Captioning is a task that aims to describe the difference between images with natural language. Most existing methods treat this problem as a difference judgment without the existence of distractors, such as viewpoint changes.…

Computer Vision and Pattern Recognition · Computer Science 2020-10-01 Xiangxi Shi , Xu Yang , Jiuxiang Gu , Shafiq Joty , Jianfei Cai