Related papers: Visual Grounding Methods for Efficient Interaction…

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yu Zhao , Wei-Ning Chen , Huseyin Atahan Inan , Samuel Kessler , Lu Wang , Lukas Wutschitz , Fangkai Yang , Chaoyun Zhang , Pasquale Minervini , Saravan Rajmohan , Robert Sim

Image Difference Grounding with Natural Language

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Wenxuan Wang , Zijia Zhao , Yisi Zhang , Yepeng Tang , Erdong Hu , Xinlong Wang , Jing Liu

R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Joonhyung Park , Peng Tang , Sagnik Das , Srikar Appalaraju , Kunwar Yashraj Singh , R. Manmatha , Shabnam Ghadar

Improved GUI Grounding via Iterative Narrowing

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Anthony Nguyen

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Beyond Clicking:A Step Towards Generalist GUI Grounding via Text Dragging

Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse…

Human-Computer Interaction · Computer Science 2026-01-13 Zeyi Liao , Yadong Lu , Boyu Gou , Huan Sun , Ahmed Awadallah

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot…

Artificial Intelligence · Computer Science 2025-11-18 SeokJoo Kwak , Jihoon Kim , Boyoun Kim , Jung Jae Yoon , Wooseok Jang , Jeonghoon Hong , Jaeho Yang , Yeong-Dae Kwon

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally…

Computer Vision and Pattern Recognition · Computer Science 2024-05-27 Wenxuan Wang , Yisi Zhang , Xingjian He , Yichen Yan , Zijia Zhao , Xinlong Wang , Jing Liu

GoClick: Lightweight Element Grounding Model for Autonomous GUI Interaction

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Hongxin Li , Yuntao Chen , Zhaoxiang Zhang

Beyond Pixels: Introspective and Interactive Grounding for Visualization Agents

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive…

Computation and Language · Computer Science 2026-04-24 Yiyang Lu , Woong Shin , Ahmad Maroof Karimi , Feiyi Wang , Jie Ren , Evgenia Smirni

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually…

Artificial Intelligence · Computer Science 2025-12-02 Aiden Yiliu Li , Bizhi Yu , Daoan Lei , Tianhe Ren , Shilong Liu

You Only Look & Listen Once: Towards Fast and Accurate Visual Grounding

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most…

Computer Vision and Pattern Recognition · Computer Science 2019-03-19 Chaorui Deng , Qi Wu , Guanghui Xu , Zhuliang Yu , Yanwu Xu , Kui Jia , Mingkui Tan

Object Detection for Graphical User Interface: Old Fashioned or Deep Learning or a Combination?

Detecting Graphical User Interface (GUI) elements in GUI images is a domain-specific object detection task. It supports many software engineering tasks, such as GUI animation and testing, GUI search and code generation. Existing studies for…

Computer Vision and Pattern Recognition · Computer Science 2020-09-08 Jieshan Chen , Mulong Xie , Zhenchang Xing , Chunyang Chen , Xiwei Xu , Liming Zhu , Guoqiang Li

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper:…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Weiming Li , Yan Shao , Jing Yang , Yujing Lu , Ling Zhong , Yuhan Wang , Manni Duan

Interventional Video Grounding with Dual Contrastive Learning

Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies,…

Computer Vision and Pattern Recognition · Computer Science 2021-07-08 Guoshun Nan , Rui Qiao , Yao Xiao , Jun Liu , Sicong Leng , Hao Zhang , Wei Lu

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be…

Human-Computer Interaction · Computer Science 2024-02-26 Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu

WinClick: GUI Grounding with Multimodal Large Language Models

Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key…

Computation and Language · Computer Science 2025-03-10 Zheng Hui , Yinheng Li , Dan zhao , Tianyi Chen , Colby Banbury , Kazuhito Koishida

SwimVG: Step-wise Multimodal Fusion and Adaption for Visual Grounding

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Liangtao Shi , Ting Liu , Xiantao Hu , Yue Hu , Quanjun Yin , Richang Hong

Vision-Grounded Machine Interpreting: Improving the Translation Process through Visual Cues

Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains…

Computation and Language · Computer Science 2025-09-30 Claudio Fantinuoli

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Siqi Pei , Liang Tang , Tiaonan Duan , Long Chen , Shuxian Li , Kaer Huang , Yanzhe Jing , Yiqiang Yan , Bo Zhang , Chenghao Jiang , Borui Zhang , Jiwen Lu