Related papers: Visual Spatial Description: Controlled Spatial-Ori…

Generating Visual Spatial Description via Holistic 3D Scene Understanding

Visual spatial description (VSD) aims to generate texts that describe the spatial relations of the given objects within images. Existing VSD work merely models the 2D geometrical vision features, thus inevitably falling prey to the problem…

Computer Vision and Pattern Recognition · Computer Science 2023-05-26 Yu Zhao , Hao Fei , Wei Ji , Jianguo Wei , Meishan Zhang , Min Zhang , Tat-Seng Chua

LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description

Visual Spatial Description (VSD) aims to generate texts that describe the spatial relationships between objects within images. Traditional visual spatial relationship classification (VSRC) methods typically output the spatial relationship…

Computer Vision and Pattern Recognition · Computer Science 2024-10-31 Yizhang Jin , Jian Li , Jiangning Zhang , Jianlong Hu , Zhenye Gan , Xin Tan , Yong Liu , Yabiao Wang , Chengjie Wang , Lizhuang Ma

Visual Semantic Description Generation with MLLMs for Image-Text Matching

Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We…

Multimedia · Computer Science 2025-07-14 Junyu Chen , Yihua Gao , Mingyong Li

VICTR: Visual Information Captured Text Representation for Text-to-Image Multimodal Tasks

Text-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images.…

Computer Vision and Pattern Recognition · Computer Science 2020-10-27 Soyeon Caren Han , Siqu Long , Siwen Luo , Kunze Wang , Josiah Poon

Unleashing Text-to-Image Diffusion Models for Visual Perception

Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly…

Computer Vision and Pattern Recognition · Computer Science 2023-03-06 Wenliang Zhao , Yongming Rao , Zuyan Liu , Benlin Liu , Jie Zhou , Jiwen Lu

Expressing Visual Relationships via Language

Describing images with text is a fundamental problem in vision-language research. Current studies in this domain mostly focus on single image captioning. However, in various real applications (e.g., image editing, difference interpretation,…

Computation and Language · Computer Science 2019-06-20 Hao Tan , Franck Dernoncourt , Zhe Lin , Trung Bui , Mohit Bansal

Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond

Visual similarities discovery (VSD) is an important task with broad e-commerce applications. Given an image of a certain object, the goal of VSD is to retrieve images of different objects with high perceptual visual similarity. Although…

Computer Vision and Pattern Recognition · Computer Science 2023-08-29 Oren Barkan , Tal Reiss , Jonathan Weill , Ori Katz , Roy Hirsch , Itzik Malkiel , Noam Koenigstein

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the…

Computer Vision and Pattern Recognition · Computer Science 2019-09-06 Wei Wei , Ling Cheng , Xianling Mao , Guangyou Zhou , Feida Zhu

Benchmarking Spatial Relationships in Text-to-Image Generation

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models…

Computer Vision and Pattern Recognition · Computer Science 2023-10-30 Tejas Gokhale , Hamid Palangi , Besmira Nushi , Vibhav Vineet , Eric Horvitz , Ece Kamar , Chitta Baral , Yezhou Yang

DualVD: An Adaptive Dual Encoding Model for Deep Visual Understanding in Visual Dialogue

Different from Visual Question Answering task that requires to answer only one question about an image, Visual Dialogue involves multiple questions which cover a broad range of visual content that could be related to any objects,…

Computer Vision and Pattern Recognition · Computer Science 2019-11-19 Xiaoze Jiang , Jing Yu , Zengchang Qin , Yingying Zhuang , Xingxing Zhang , Yue Hu , Qi Wu

ViewSRD: 3D Visual Grounding via Structured Multi-View Decomposition

3D visual grounding aims to identify and localize objects in a 3D space based on textual descriptions. However, existing methods struggle with disentangling targets from anchors in complex multi-anchor queries and resolving inconsistencies…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Ronggang Huang , Haoxin Yang , Yan Cai , Xuemiao Xu , Huaidong Zhang , Shengfeng He

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Yu Zhao , Hao Fei , Xiangtai Li , Libo Qin , Jiayi Ji , Hongyuan Zhu , Meishan Zhang , Min Zhang , Jianguo Wei

Text Descriptions are Compressive and Invariant Representations for Visual Learning

Modern image classification is based upon directly predicting classes via large discriminative networks, which do not directly contain information about the intuitive visual features that may constitute a classification decision. Recently,…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Zhili Feng , Anna Bair , J. Zico Kolter

Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions

Image captioning models tend to describe images in an object-centric way, emphasising visible objects. But image descriptions can also abstract away from objects and describe the type of scene depicted. In this paper, we explore the…

Computation and Language · Computer Science 2022-11-11 Michele Cafagna , Kees van Deemter , Albert Gatt

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities…

Computer Vision and Pattern Recognition · Computer Science 2023-07-24 Zhihong Chen , Ruifei Zhang , Yibing Song , Xiang Wan , Guanbin Li

Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering

Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial…

Computer Vision and Pattern Recognition · Computer Science 2023-06-16 Hao Li , Jinfa Huang , Peng Jin , Guoli Song , Qi Wu , Jie Chen

VSCD: Video-based Scene Change Detection in Unaligned Scenes

Detecting what has changed in an environment is essential for long-term autonomy, yet most change detection settings assume fixed viewpoints, mild misalignment, or only a few changed objects. We introduce Video-based Scene Change Detection…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Jiae Yoon , Ue-Hwan Kim

Relational Graph Learning for Grounded Video Description Generation

Grounded video description (GVD) encourages captioning models to attend to appropriate video regions (e.g., objects) dynamically and generate a description. Such a setting can help explain the decisions of captioning models and prevents the…

Computer Vision and Pattern Recognition · Computer Science 2021-12-03 Wenqiao Zhang , Xin Eric Wang , Siliang Tang , Haizhou Shi , Haocheng Shi , Jun Xiao , Yueting Zhuang , William Yang Wang

Evaluating Multimodal Representations on Visual Semantic Textual Similarity

The combination of visual and textual representations has produced excellent results in tasks such as image captioning and visual question answering, but the inference capabilities of multimodal representations are largely untested. In the…

Computation and Language · Computer Science 2020-04-07 Oier Lopez de Lacalle , Ander Salaberria , Aitor Soroa , Gorka Azkune , Eneko Agirre

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and…

Computer Vision and Pattern Recognition · Computer Science 2023-11-10 Cheng Yang , Rui Xu , Ye Guo , Peixiang Huang , Yiru Chen , Wenkui Ding , Zhongyuan Wang , Hong Zhou