Related papers: Improving Visual-Semantic Embedding with Adaptive …

Learning the Best Pooling Strategy for Visual Semantic Embedding

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models…

Computer Vision and Pattern Recognition · Computer Science 2021-07-07 Jiacheng Chen , Hexiang Hu , Hao Wu , Yuning Jiang , Changhu Wang

VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling

Jointing visual-semantic embeddings (VSE) have become a research hotpot for the task of image annotation, which suffers from the issue of semantic gap, i.e., the gap between images' visual features (low-level) and labels' semantic features…

Computer Vision and Pattern Recognition · Computer Science 2018-08-14 Guibing Guo , Songlin Zhai , Fajie Yuan , Yuan Liu , Xingwei Wang

Improving Visual-Semantic Embeddings by Learning Semantically-Enhanced Hard Negatives for Cross-modal Information Retrieval

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard…

Computer Vision and Pattern Recognition · Computer Science 2023-02-15 Yan Gong , Georgina Cosma

Asymmetric Visual Semantic Embedding Framework for Efficient Vision-Language Alignment

Learning visual semantic similarity is a critical challenge in bridging the gap between images and texts. However, there exist inherent variations between vision and language data, such as information density, i.e., images can contain…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Yang Liu , Mengyuan Liu , Shudong Huang , Jiancheng Lv

VSE++: Improving Visual-Semantic Embeddings with Hard Negatives

We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to…

Machine Learning · Computer Science 2018-07-31 Fartash Faghri , David J. Fleet , Jamie Ryan Kiros , Sanja Fidler

Dynamic Visual Semantic Sub-Embeddings and Fast Re-Ranking

The core of cross-modal matching is to accurately measure the similarity between different modalities in a unified representation space. However, compared to textual descriptions of a certain perspective, the visual modality has more…

Computer Vision and Pattern Recognition · Computer Science 2023-12-22 Wenzhang Wei , Zhipeng Gui , Changguang Wu , Anqi Zhao , Dehua Peng , Huayi Wu

Dissecting Deep Metric Learning Losses for Image-Text Retrieval

Visual-Semantic Embedding (VSE) is a prevalent approach in image-text retrieval by learning a joint embedding space between the image and language modalities where semantic similarities would be preserved. The triplet loss with…

Computer Vision and Pattern Recognition · Computer Science 2022-10-25 Hong Xuan , Xi Chen

Multiple Visual-Semantic Embedding for Video Retrieval from Query Sentence

Visual-semantic embedding aims to learn a joint embedding space where related video and sentence instances are located close to each other. Most existing methods put instances in a single embedding space. However, they struggle to embed…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Huy Manh Nguyen , Tomo Miyazaki , Yoshihiro Sugaya , Shinichiro Omachi

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching

Enabling Visual Semantic Models to effectively handle multi-view description matching has been a longstanding challenge. Existing methods typically learn a set of embeddings to find the optimal match for each view's text and compute…

Computer Vision and Pattern Recognition · Computer Science 2025-07-18 Yang Liu , Wentao Feng , Zhuoyao Liu , Shudong Huang , Jiancheng Lv

Language Models as Zero-shot Visual Semantic Learners

Visual Semantic Embedding (VSE) models, which map images into a rich semantic embedding space, have been a milestone in object recognition and zero-shot learning. Current approaches to VSE heavily rely on static word em-bedding techniques.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-27 Yue Jiao , Jonathon Hare , Adam Prügel-Bennett

Learning Visually-Grounded Semantics from Contrastive Adversarial Samples

We study the problem of grounding distributional representations of texts on the visual domain, namely visual-semantic embeddings (VSE for short). Begin with an insightful adversarial attack on VSE embeddings, we show the limitation of…

Computation and Language · Computer Science 2018-06-28 Haoyue Shi , Jiayuan Mao , Tete Xiao , Yuning Jiang , Jian Sun

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods…

Computer Vision and Pattern Recognition · Computer Science 2022-02-11 Lixiang Ru , Bo Du , Yibing Zhan , Chen Wu

PADS: Policy-Adapted Sampling for Visual Similarity Learning

Learning visual similarity requires to learn relations, typically between triplets of images. Albeit triplet approaches being powerful, their computational complexity mostly limits training to only a subset of all possible training…

Computer Vision and Pattern Recognition · Computer Science 2020-03-31 Karsten Roth , Timo Milbich , Björn Ommer

UniVSE: Robust Visual Semantic Embeddings via Structured Semantic Representations

We propose Unified Visual-Semantic Embeddings (UniVSE) for learning a joint space of visual and textual concepts. The space unifies the concepts at different levels, including objects, attributes, relations, and full scenes. A contrastive…

Computer Vision and Pattern Recognition · Computer Science 2019-04-30 Hao Wu , Jiayuan Mao , Yufeng Zhang , Yuning Jiang , Lei Li , Weiwei Sun , Wei-Ying Ma

SSBNet: Improving Visual Recognition Efficiency by Adaptive Sampling

Downsampling is widely adopted to achieve a good trade-off between accuracy and latency for visual recognition. Unfortunately, the commonly used pooling layers are not learned, and thus cannot preserve important information. As another…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Ho Man Kwan , Shenghui Song

Visualizing Deep Similarity Networks

For convolutional neural network models that optimize an image embedding, we propose a method to highlight the regions of images that contribute most to pairwise similarity. This work is a corollary to the visualization tools developed for…

Computer Vision and Pattern Recognition · Computer Science 2019-01-04 Abby Stylianou , Richard Souvenir , Robert Pless

Auto-pooling: Learning to Improve Invariance of Image Features from Image Sequences

Learning invariant representations from images is one of the hardest challenges facing computer vision. Spatial pooling is widely used to create invariance to spatial shifting, but it is restricted to convolutional models. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2013-03-19 Sainbayar Sukhbaatar , Takaki Makino , Kazuyuki Aihara

Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching

With the rapid development of multimodal learning, the image-text matching task, as a bridge connecting vision and language, has become increasingly important. Based on existing research, this study proposes an innovative visual semantic…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Wenjing Chen

CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

Human perception of visual similarity is inherently adaptive and subjective, depending on the users' interests and focus. However, most image retrieval systems fail to reflect this flexibility, relying on a fixed, monolithic metric that…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Sohwi Lim , Lee Hyoseok , Jungjoon Park , Tae-Hyun Oh

Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval

Visual-semantic embedding aims to find a shared latent space where related visual and textual instances are close to each other. Most current methods learn injective embedding functions that map an instance to a single point in the shared…

Computer Vision and Pattern Recognition · Computer Science 2019-07-18 Yale Song , Mohammad Soleymani