Related papers: Dual Encoding for Zero-Example Video Retrieval

Dual Encoding for Video Retrieval by Text

This paper attacks the challenging problem of video retrieval by text. In such a retrieval paradigm, an end user searches for unlabeled videos by ad-hoc queries described exclusively in the form of a natural-language sentence, with no…

Computer Vision and Pattern Recognition · Computer Science 2021-02-19 Jianfeng Dong , Xirong Li , Chaoxi Xu , Xun Yang , Gang Yang , Xun Wang , Meng Wang

Reasoning Text-to-Video Retrieval via Digital Twin Video Representations and Large Language Models

The goal of text-to-video retrieval is to search large databases for relevant videos based on text queries. Existing methods have progressed to handling explicit queries where the visual content of interest is described explicitly; however,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Yiqing Shen , Chenxiao Fan , Chenjia Li , Mathias Unberath

Interpretable Embedding for Ad-hoc Video Search

Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the…

Computer Vision and Pattern Recognition · Computer Science 2024-02-20 Jiaxin Wu , Chong-Wah Ngo

Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually…

Computer Vision and Pattern Recognition · Computer Science 2020-07-07 Xun Yang , Jianfeng Dong , Yixin Cao , Xun Wang , Meng Wang , Tat-Seng Chua

Learning text-to-video retrieval from image captioning

We describe a protocol to study text-to-video retrieval training with unlabeled videos, where we assume (i) no access to labels for any videos, i.e., no access to the set of ground-truth captions, but (ii) access to labeled images in the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-29 Lucas Ventura , Cordelia Schmid , Gül Varol

A Straightforward Framework For Video Retrieval Using CLIP

Video Retrieval is a challenging task where a text query is matched to a video or vice versa. Most of the existing approaches for addressing such a problem rely on annotations made by the users. Although simple, this approach is not always…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Jesús Andrés Portillo-Quintero , José Carlos Ortiz-Bayliss , Hugo Terashima-Marín

Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions

Contrastively-trained Vision-Language Models (VLMs), such as CLIP, have become the standard approach for learning discriminative vision-language representations. However, these models often exhibit shallow language understanding,…

Computer Vision and Pattern Recognition · Computer Science 2025-09-24 Ioanna Ntinou , Alexandros Xenos , Yassine Ouali , Adrian Bulat , Georgios Tzimiropoulos

Video retrieval based on deep convolutional neural network

Recently, with the enormous growth of online videos, fast video retrieval research has received increasing attention. As an extension of image hashing techniques, traditional video hashing methods mainly depend on hand-crafted features and…

Computer Vision and Pattern Recognition · Computer Science 2017-12-04 Yj Dong , JG Li

Strategies for Searching Video Content with Text Queries or Video Examples

The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search. However, metadata is often lacking for user-generated videos,…

Information Retrieval · Computer Science 2016-06-21 Shoou-I Yu , Yi Yang , Zhongwen Xu , Shicheng Xu , Deyu Meng , Zexi Mao , Zhigang Ma , Ming Lin , Xuanchong Li , Huan Li , Zhenzhong Lan , Lu Jiang , Alexander G. Hauptmann , Chuang Gan , Xingzhong Du , Xiaojun Chang

Survey of Visual-Semantic Embedding Methods for Zero-Shot Image Retrieval

Visual-semantic embedding is an interesting research topic because it is useful for various tasks, such as visual question answering (VQA), image-text retrieval, image captioning, and scene graph generation. In this paper, we focus on…

Computer Vision and Pattern Recognition · Computer Science 2021-09-29 Kazuya Ueki

Multi-modal Transformer for Video Retrieval

The task of retrieving video content relevant to natural language queries plays a critical role in effectively handling internet-scale datasets. Most of the existing methods for this caption-to-video retrieval problem do not fully exploit…

Computer Vision and Pattern Recognition · Computer Science 2020-07-22 Valentin Gabeur , Chen Sun , Karteek Alahari , Cordelia Schmid

Use What You Have: Video Retrieval Using Representations From Collaborative Experts

The rapid growth of video on the internet has made searching for video content using natural language queries a significant challenge. Human-generated queries for video datasets `in the wild' vary a lot in terms of degree of specificity,…

Computer Vision and Pattern Recognition · Computer Science 2020-02-17 Yang Liu , Samuel Albanie , Arsha Nagrani , Andrew Zisserman

SEA: Sentence Encoder Assembly for Video Retrieval by Textual Queries

Retrieving unlabeled videos by textual queries, known as Ad-hoc Video Search (AVS), is a core theme in multimedia data management and retrieval. The success of AVS counts on cross-modal representation learning that encodes both query…

Computer Vision and Pattern Recognition · Computer Science 2020-11-25 Xirong Li , Fangming Zhou , Chaoxi Xu , Jiaqi Ji , Gang Yang

Look Beyond Saliency: Low-Attention Guided Dual Encoding for Video Semantic Search

Video semantic search in densely crowded scenes remains a challenging task due to visual encoders tendency to prioritize salient foreground regions while neglecting contextually important, background areas. We propose an Inverse Attention…

Computer Vision and Pattern Recognition · Computer Science 2026-05-08 Faisal Aljehrai , Mohammed A. Alkhrashi , Alreem Almuhrij , Sarah Abuhimed , Noorh Aldossary , Abdullah Aldwyish , Raied Aljadaany , Huda Alamri , Muhammad Kamran J Khan

SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models

Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images. We revisit a zero-shot generate-and-encode pipeline: a vision-language model first…

Information Retrieval · Computer Science 2025-09-22 Thong Nguyen , Yibin Lei , Jia-Huei Ju , Andrew Yates

Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in…

Information Retrieval · Computer Science 2023-08-29 Zichen Yuan , Qi Shen , Bingyi Zheng , Yuting Liu , Linying Jiang , Guibing Guo

Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval

In the recent years, the dual-encoder vision-language models (\eg CLIP) have achieved remarkable text-to-image retrieval performance. However, we discover that these models usually results in very different retrievals for a pair of…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Jiacheng Cheng , Hijung Valentina Shin , Nuno Vasconcelos , Bryan Russell , Fabian Caba Heilbron

Language-Agnostic Visual Embeddings for Cross-Script Handwriting Retrieval

Handwritten word retrieval is vital for digital archives but remains challenging due to large handwriting variability and cross-lingual semantic gaps. While large vision-language models offer potential solutions, their prohibitive…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Fangke Chen , Tianhao Dong , Sirry Chen , Guobin Zhang , Yishu Zhang , Yining Chen

Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval

There has been significant attention to the research on dense video captioning, which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a…

Computer Vision and Pattern Recognition · Computer Science 2024-04-12 Minkuk Kim , Hyeon Bae Kim , Jinyoung Moon , Jinwoo Choi , Seong Tae Kim

Multimodal Lengthy Videos Retrieval Framework and Evaluation Metric

Precise video retrieval requires multi-modal correlations to handle unseen vocabulary and scenes, becoming more complex for lengthy videos where models must perform effectively without prior training on a specific dataset. We introduce a…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Mohamed Eltahir , Osamah Sarraj , Mohammed Bremoo , Mohammed Khurd , Abdulrahman Alfrihidi , Taha Alshatiri , Mohammad Almatrafi , Tanveer Hussain