Related papers: Visual Text Correction

Word2VisualVec: Image and Video to Sentence Matching by Visual Feature Prediction

This paper strives to find the sentence best describing the content of an image or video. Different from existing works, which rely on a joint subspace for image / video to sentence matching, we propose to do so in a visual space only. We…

Computer Vision and Pattern Recognition · Computer Science 2016-11-28 Jianfeng Dong , Xirong Li , Cees G. M. Snoek

Deep Learning for Video-Text Retrieval: a Review

Video-Text Retrieval (VTR) aims to search for the most relevant video related to the semantics in a given sentence, and vice versa. In general, this retrieval task is composed of four successive steps: video and textual feature…

Computer Vision and Pattern Recognition · Computer Science 2023-02-27 Cunjuan Zhu , Qi Jia , Wei Chen , Yanming Guo , Yu Liu

Let Me Finish My Sentence: Video Temporal Grounding with Holistic Text Understanding

Video Temporal Grounding (VTG) aims to identify visual frames in a video clip that match text queries. Recent studies in VTG employ cross-attention to correlate visual frames and text queries as individual token sequences. However, these…

Computer Vision and Pattern Recognition · Computer Science 2024-10-18 Jongbhin Woo , Hyeonggon Ryu , Youngjoon Jang , Jae Won Cho , Joon Son Chung

A Comprehensive Review of the Video-to-Text Problem

Research in the Vision and Language area encompasses challenging topics that seek to connect visual and textual information. When the visual information is related to videos, this takes us into Video-Text Research, which includes several…

Computer Vision and Pattern Recognition · Computer Science 2021-12-02 Jesus Perez-Martin , Benjamin Bustos , Silvio Jamil F. Guimarães , Ivan Sipiran , Jorge Pérez , Grethel Coello Said

Predicting Visual Features from Text for Image and Video Caption Retrieval

This paper strives to find amidst a set of sentences the one best describing the content of a given image or video. Different from existing works, which rely on a joint subspace for their image and video caption retrieval, we propose to do…

Computer Vision and Pattern Recognition · Computer Science 2018-07-17 Jianfeng Dong , Xirong Li , Cees G. M. Snoek

Video and Text Matching with Conditioned Embeddings

We present a method for matching a text sentence from a given corpus to a given video clip and vice versa. Traditionally video and text matching is done by learning a shared embedding space and the encoding of one modality is independent of…

Computer Vision and Pattern Recognition · Computer Science 2021-10-22 Ameen Ali , Idan Schwartz , Tamir Hazan , Lior Wolf

Video Abnormal Event Detection by Learning to Complete Visual Cloze Tests

Although deep neural networks (DNNs) enable great progress in video abnormal event detection (VAD), existing solutions typically suffer from two issues: (1) The localization of video events cannot be both precious and comprehensive. (2) The…

Computer Vision and Pattern Recognition · Computer Science 2021-09-20 Siqi Wang , Guang Yu , Zhiping Cai , Xinwang Liu , En Zhu , Jianping Yin

Syntax Customized Video Captioning by Imitating Exemplar Sentences

Enhancing the diversity of sentences to describe video contents is an important problem arising in recent video captioning research. In this paper, we explore this problem from a novel perspective of customizing video captions by imitating…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 Yitian Yuan , Lin Ma , Wenwu Zhu

CLIP4Caption: CLIP for Video Caption

Video captioning is a challenging task since it requires generating sentences describing various diverse and complex videos. Existing video captioning models lack adequate visual representation due to the neglect of the existence of gaps…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Mingkang Tang , Zhanyu Wang , Zhenhua Liu , Fengyun Rao , Dian Li , Xiu Li

Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval

Cross-modal retrieval between videos and texts has gained increasing research interest due to the rapid emergence of videos on the web. Generally, a video contains rich instance and event information and the query text only describes a part…

Computer Vision and Pattern Recognition · Computer Science 2022-09-28 Chengzhi Lin , Ancong Wu , Junwei Liang , Jun Zhang , Wenhang Ge , Wei-Shi Zheng , Chunhua Shen

Dense Video Captioning: A Survey of Techniques, Datasets and Evaluation Protocols

Untrimmed videos have interrelated events, dependencies, context, overlapping events, object-object interactions, domain specificity, and other semantics that are worth highlighting while describing a video in natural language. Owing to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-07 Iqra Qasim , Alexander Horsch , Dilip K. Prasad

Learning the Visualness of Text Using Large Vision-Language Models

Visual text evokes an image in a person's mind, while non-visual text fails to do so. A method to automatically detect visualness in text will enable text-to-image retrieval and generation models to augment text with relevant images. This…

Computation and Language · Computer Science 2023-10-24 Gaurav Verma , Ryan A. Rossi , Christopher Tensmeyer , Jiuxiang Gu , Ani Nenkova

VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning

Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling. Following the human perception process, where the scene is effectively understood…

Computer Vision and Pattern Recognition · Computer Science 2023-02-17 Kashu Yamazaki , Khoa Vo , Sang Truong , Bhiksha Raj , Ngan Le

Queries Are Not Alone: Clustering Text Embeddings for Video Search

The rapid proliferation of video content across various platforms has highlighted the urgent need for advanced video retrieval systems. Traditional methods, which primarily depend on directly matching textual queries with video metadata,…

Information Retrieval · Computer Science 2025-10-10 Peyang Liu , Xi Wang , Ziqiang Cui , Wei Ye

VidText: Towards Comprehensive Evaluation for Video Text Understanding

Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Zhoufaran Yang , Yan Shu , Jing Wang , Zhifei Yang , Yan Zhang , Yu Li , Keyang Lu , Gangyan Zeng , Shaohui Liu , Yu Zhou , Nicu Sebe

Weakly-Supervised Alignment of Video With Text

Suppose that we are given a set of videos, along with natural language descriptions in the form of multiple sentences (e.g., manual annotations, movie scripts, sport summaries etc.), and that these sentences appear in the same temporal…

Computer Vision and Pattern Recognition · Computer Science 2015-12-22 Piotr Bojanowski , Rémi Lajugie , Edouard Grave , Francis Bach , Ivan Laptev , Jean Ponce , Cordelia Schmid

Visual Semantic Reasoning for Image-Text Matching

Image-text matching has been a hot research topic bridging the vision and language areas. It remains challenging because the current representation of image usually lacks global semantic concepts as in its corresponding text caption. To…

Computer Vision and Pattern Recognition · Computer Science 2019-09-09 Kunpeng Li , Yulun Zhang , Kai Li , Yuanyuan Li , Yun Fu

Mining for meaning: from vision to language through multiple networks consensus

Describing visual data into natural language is a very challenging task, at the intersection of computer vision, natural language processing and machine learning. Language goes well beyond the description of physical objects and their…

Computer Vision and Pattern Recognition · Computer Science 2020-05-26 Iulia Duta , Andrei Liviu Nicolicioiu , Simion-Vlad Bogolin , Marius Leordeanu

Weakly Supervised Video Representation Learning with Unaligned Text for Sequential Videos

Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate…

Computer Vision and Pattern Recognition · Computer Science 2023-03-29 Sixun Dong , Huazhang Hu , Dongze Lian , Weixin Luo , Yicheng Qian , Shenghua Gao

Learning Convolutional Text Representations for Visual Question Answering

Visual question answering is a recently proposed artificial intelligence task that requires a deep understanding of both images and texts. In deep learning, images are typically modeled through convolutional neural networks, and texts are…

Machine Learning · Computer Science 2018-09-05 Zhengyang Wang , Shuiwang Ji