Related papers: Grounded Situation Recognition

Grounded Situation Recognition with Transformers

Grounded Situation Recognition (GSR) is the task that not only classifies a salient action (verb), but also predicts entities (nouns) associated with semantic roles and their locations in the given image. Inspired by the remarkable success…

Computer Vision and Pattern Recognition · Computer Science 2021-11-22 Junhyeong Cho , Youngseok Yoon , Hyeonjun Lee , Suha Kwak

GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement

Grounded Situation Recognition (GSR) aims to generate structured semantic summaries of images for "human-like" event understanding. Specifically, GSR task not only detects the salient activity verb (e.g. buying), but also predicts all…

Computer Vision and Pattern Recognition · Computer Science 2022-11-29 Zhi-Qi Cheng , Qi Dai , Siyao Li , Teruko Mitamura , Alexander G. Hauptmann

Open Scene Understanding: Grounded Situation Recognition Meets Segment Anything for Helping People with Visual Impairments

Grounded Situation Recognition (GSR) is capable of recognizing and interpreting visual scenes in a contextually intuitive way, yielding salient activities (verbs) and the involved entities (roles) depicted in images. In this work, we focus…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Ruiping Liu , Jiaming Zhang , Kunyu Peng , Junwei Zheng , Ke Cao , Yufan Chen , Kailun Yang , Rainer Stiefelhagen

Rethinking the Two-Stage Framework for Grounded Situation Recognition

Grounded Situation Recognition (GSR), i.e., recognizing the salient activity (or verb) category in an image (e.g., buying) and detecting all corresponding semantic roles (e.g., agent and goods), is an essential step towards "human-like"…

Computer Vision and Pattern Recognition · Computer Science 2021-12-13 Meng Wei , Long Chen , Wei Ji , Xiaoyu Yue , Tat-Seng Chua

GSR: Learning Structured Reasoning for Embodied Manipulation

Despite rapid progress, embodied agents still struggle with long-horizon manipulation that requires maintaining spatial consistency, causal dependencies, and goal constraints. A key limitation of existing approaches is that task reasoning…

Robotics · Computer Science 2026-02-04 Kewei Hu , Michael Zhang , Wei Ying , Tianhao Liu , Guoqiang Hao , Zimeng Li , Wanchan Yu , Jiajian Jing , Fangwen Chen , Hanwen Kang

Grounded Video Situation Recognition

Dense video understanding requires answering several questions such as who is doing what to whom, with what, how, why, and where. Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple…

Computer Vision and Pattern Recognition · Computer Science 2022-10-21 Zeeshan Khan , C. V. Jawahar , Makarand Tapaswi

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires…

Computer Vision and Pattern Recognition · Computer Science 2024-04-25 Jiaming Lei , Lin Li , Chunping Wang , Jun Xiao , Long Chen

Semantic Image Retrieval via Active Grounding of Visual Situations

We describe a novel architecture for semantic image retrieval---in particular, retrieval of instances of visual situations. Visual situations are concepts such as "a boxing match," "walking the dog," "a crowd waiting for a bus," or "a game…

Computer Vision and Pattern Recognition · Computer Science 2017-11-02 Max H. Quinn , Erik Conser , Jordan M. Witte , Melanie Mitchell

Grid Spatial Understanding: A Dataset for Textual Spatial Reasoning over Grids, Embodied Settings, and Coordinate Structures

We introduce GSU, a text-only grid dataset to evaluate the spatial reasoning capabilities of LLMs over 3 core tasks: navigation, object localization, and structure composition. By forgoing visual inputs, isolating spatial reasoning from…

Computation and Language · Computer Science 2026-03-19 Risham Sidhu , Julia Hockenmaier

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation

Grounded Multimodal Named Entity Recognition (GMNER) task aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging attributes: 1) The tenuous correlation between images and…

Multimedia · Computer Science 2025-09-03 Jinyuan Li , Ziyan Li , Han Li , Jianfei Yu , Rui Xia , Di Sun , Gang Pan

Commonly Uncommon: Semantic Sparsity in Situation Recognition

Semantic sparsity is a common challenge in structured visual classification problems; when the output space is complex, the vast majority of the possible predictions are rarely, if ever, seen in the training set. This paper studies semantic…

Computer Vision and Pattern Recognition · Computer Science 2016-12-06 Mark Yatskar , Vicente Ordonez , Luke Zettlemoyer , Ali Farhadi

Beyond Referring Expressions: Scenario Comprehension Visual Grounding

Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more…

Computer Vision and Pattern Recognition · Computer Science 2026-04-03 Ruozhen He , Nisarg A. Shah , Qihua Dong , Zilin Xiao , Jaywon Koo , Vicente Ordonez

GSR: A Generalized Symbolic Regression Approach

Identifying the mathematical relationships that best describe a dataset remains a very challenging problem in machine learning, and is known as Symbolic Regression (SR). In contrast to neural networks which are often treated as black boxes,…

Machine Learning · Computer Science 2023-01-10 Tony Tohme , Dehong Liu , Kamal Youcef-Toumi

CAGE: Context-Aware Grasping Engine

Semantic grasping is the problem of selecting stable grasps that are functionally suitable for specific object manipulation tasks. In order for robots to effectively perform object manipulation, a broad sense of contexts, including object…

Robotics · Computer Science 2020-06-09 Weiyu Liu , Angel Daruna , Sonia Chernova

Learning Cross-modal Context Graph for Visual Grounding

Visual grounding is a ubiquitous building block in many vision-language tasks and yet remains challenging due to large variations in visual and linguistic features of grounding entities, strong context effect and the resulting semantic…

Computer Vision and Pattern Recognition · Computer Science 2019-11-26 Yongfei Liu , Bo Wan , Xiaodan Zhu , Xuming He

Zero-Shot Grounding of Objects from Natural Language Queries

A phrase grounding system localizes a particular object in an image referred to by a natural language query. In previous work, the phrases were restricted to have nouns that were encountered in training, we extend the task to Zero-Shot…

Computer Vision and Pattern Recognition · Computer Science 2019-08-21 Arka Sadhu , Kan Chen , Ram Nevatia

GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its immense potential for bridging the gap between vision and language modalities. However, advancements in this…

Computer Vision and Pattern Recognition · Computer Science 2025-07-16 Rui Hu , Lianghui Zhu , Yuxuan Zhang , Tianheng Cheng , Lei Liu , Heng Liu , Longjin Ran , Xiaoxin Chen , Wenyu Liu , Xinggang Wang

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-computer interaction, robotics and knowledge mining. Existing algorithms generally formulate the task as selection from a set of bounding box proposals obtained from deep net…

Computer Vision and Pattern Recognition · Computer Science 2018-04-02 Raymond A. Yeh , Jinjun Xiong , Wen-mei W. Hwu , Minh N. Do , Alexander G. Schwing

Towards Visual Grounding: A Survey

Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships…

Computer Vision and Pattern Recognition · Computer Science 2025-11-12 Linhui Xiao , Xiaoshan Yang , Xiangyuan Lan , Yaowei Wang , Changsheng Xu

LLMs as Bridges: Reformulating Grounded Multimodal Named Entity Recognition

Grounded Multimodal Named Entity Recognition (GMNER) is a nascent multimodal task that aims to identify named entities, entity types and their corresponding visual regions. GMNER task exhibits two challenging properties: 1) The weak…

Computer Vision and Pattern Recognition · Computer Science 2024-05-30 Jinyuan Li , Han Li , Di Sun , Jiahao Wang , Wenkun Zhang , Zan Wang , Gang Pan