Related papers: Visually Grounded Concept Composition

Grounded Semantic Composition for Visual Scenes

We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex…

Artificial Intelligence · Computer Science 2011-07-04 P. Gorniak , D. Roy

MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition

Humans have the ability to learn novel compositional concepts by recalling and generalizing primitive concepts acquired from past experiences. Inspired by this observation, in this paper, we propose MetaReVision, a retrieval-enhanced…

Computation and Language · Computer Science 2023-11-06 Guangyue Xu , Parisa Kordjamshidi , Joyce Chai

Variational Cross-Graph Reasoning and Adaptive Structured Semantics Learning for Compositional Temporal Grounding

Temporal grounding is the task of locating a specific segment from an untrimmed video according to a query sentence. This task has achieved significant momentum in the computer vision community as it enables activity grounding beyond…

Computer Vision and Pattern Recognition · Computer Science 2023-05-16 Juncheng Li , Siliang Tang , Linchao Zhu , Wenqiao Zhang , Yi Yang , Tat-Seng Chua , Fei Wu , Yueting Zhuang

Neural Algebra of Classifiers

The world is fundamentally compositional, so it is natural to think of visual recognition as the recognition of basic visually primitives that are composed according to well-defined rules. This strategy allows us to recognize unseen complex…

Computer Vision and Pattern Recognition · Computer Science 2018-01-29 Rodrigo Santa Cruz , Basura Fernando , Anoop Cherian , Stephen Gould

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic…

Computer Vision and Pattern Recognition · Computer Science 2019-11-26 Yongfei Liu , Bo Wan , Xiaodan Zhu , Xuming He

Learning Compositional Visual Concepts with Mutual Consistency

Compositionality of semantic concepts in image synthesis and analysis is appealing as it can help in decomposing known and generatively recomposing unknown data. For instance, we may learn concepts of changing illumination, geometry or…

Computer Vision and Pattern Recognition · Computer Science 2018-03-29 Yunye Gong , Srikrishna Karanam , Ziyan Wu , Kuan-Chuan Peng , Jan Ernst , Peter C. Doerschuk

Visually Grounded Compound PCFGs

Exploiting visual groundings for language understanding has recently been drawing much attention. In this work, we study visually grounded grammar induction and learn a constituency parser from both unlabeled text and its visual groundings.…

Computation and Language · Computer Science 2020-12-08 Yanpeng Zhao , Ivan Titov

Compositional Temporal Grounding with Structured Variational Cross-Graph Correspondence Learning

Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence. Thanks to the semantic diversity of natural language descriptions, temporal grounding allows activity grounding…

Computer Vision and Pattern Recognition · Computer Science 2022-03-29 Juncheng Li , Junlin Xie , Long Qian , Linchao Zhu , Siliang Tang , Fei Wu , Yi Yang , Yueting Zhuang , Xin Eric Wang

Grounded learning for compositional vector semantics

Categorical compositional distributional semantics is an approach to modelling language that combines the success of vector-based models of meaning with the compositional power of formal semantics. However, this approach was developed…

Computation and Language · Computer Science 2024-01-17 Martha Lewis

COVR: A test-bed for Visually Grounded Compositional Generalization with real images

While interest in models that generalize at test time to new compositions has risen in recent years, benchmarks in the visually-grounded domain have thus far been restricted to synthetic images. In this work, we propose COVR, a new test-bed…

Computation and Language · Computer Science 2021-09-23 Ben Bogin , Shivanshu Gupta , Matt Gardner , Jonathan Berant

3D Concept Grounding on Neural Fields

In this paper, we address the challenging problem of 3D concept grounding (i.e. segmenting and learning visual concepts) by looking at RGBD images and reasoning about paired questions and answers. Existing visual reasoning approaches…

Computer Vision and Pattern Recognition · Computer Science 2022-07-14 Yining Hong , Yilun Du , Chunru Lin , Joshua B. Tenenbaum , Chuang Gan

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

Vision-language (VL) pretrained models have achieved impressive performance on multimodal reasoning and zero-shot recognition tasks. Many of these VL models are pretrained on unlabeled image and caption pairs from the internet. In this…

Computer Vision and Pattern Recognition · Computer Science 2023-05-30 Tian Yun , Usha Bhalla , Ellie Pavlick , Chen Sun

Composition-Grounded Data Synthesis for Visual Reasoning

Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Xinyi Gu , Jiayuan Mao , Zhang-Wei Hong , Zhuoran Yu , Pengyuan Li , Dhiraj Joshi , Rogerio Feris , Zexue He

Learning Compositional Representations for Effective Low-Shot Generalization

We propose Recognition as Part Composition (RPC), an image encoding approach inspired by human cognition. It is based on the cognitive theory that humans recognize complex objects by components, and that they build a small compact…

Computer Vision and Pattern Recognition · Computer Science 2022-04-19 Samarth Mishra , Pengkai Zhu , Venkatesh Saligrama

Compositional Generalization with Grounded Language Models

Grounded language models use external sources of information, such as knowledge graphs, to meet some of the general challenges associated with pre-training. By extending previous work on compositional generalization in semantic parsing, we…

Computation and Language · Computer Science 2024-06-10 Sondre Wold , Étienne Simon , Lucas Georges Gabriel Charpentier , Egor V. Kostylev , Erik Velldal , Lilja Øvrelid

Learning to Compose and Reason with Language Tree Structures for Visual Grounding

Grounding natural language in images, such as localizing "the black dog on the left of the tree", is one of the core problems in artificial intelligence, as it needs to comprehend the fine-grained and compositional language space. However,…

Computer Vision and Pattern Recognition · Computer Science 2019-06-06 Richang Hong , Daqing Liu , Xiaoyu Mo , Xiangnan He , Hanwang Zhang

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

Causal Graphical Models for Vision-Language Compositional Understanding

Recent work has empirically shown that Vision-Language Models (VLMs) struggle to fully understand the compositional properties of the human language, usually modeling an image caption as a "bag of words". As a result, they perform poorly on…

Computer Vision and Pattern Recognition · Computer Science 2025-04-16 Fiorenzo Parascandolo , Nicholas Moratelli , Enver Sangineto , Lorenzo Baraldi , Rita Cucchiara

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object…

Computer Vision and Pattern Recognition · Computer Science 2026-04-27 Lihao Zheng , Zhenwei Shao , Yu Zhou , Yan Yang , Xintian Shen , Jiawei Chen , Hao Ma , Tao Wei

Where and Who? Automatic Semantic-Aware Person Composition

Image compositing is a method used to generate realistic yet fake imagery by inserting contents from one image to another. Previous work in compositing has focused on improving appearance compatibility of a user selected foreground segment…

Graphics · Computer Science 2017-12-05 Fuwen Tan , Crispin Bernier , Benjamin Cohen , Vicente Ordonez , Connelly Barnes