English
Related papers

Related papers: Visual Grounding Methods for Efficient Interaction…

200 papers

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yu Zhao , Wei-Ning Chen , Huseyin Atahan Inan , Samuel Kessler , Lu Wang , Lukas Wutschitz , Fangkai Yang , Chaoyun Zhang , Pasquale Minervini , Saravan Rajmohan , Robert Sim

Visual grounding (VG) typically focuses on locating regions of interest within an image using natural language, and most existing VG methods are limited to single-image interpretations. This limits their applicability in real-world…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Wenxuan Wang , Zijia Zhao , Yisi Zhang , Yepeng Tang , Erdong Hu , Xinlong Wang , Jing Liu

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Joonhyung Park , Peng Tang , Sagnik Das , Srikar Appalaraju , Kunwar Yashraj Singh , R. Manmatha , Shabnam Ghadar

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Anthony Nguyen

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Graphical user interface (GUI) grounding, the process of mapping human instructions to GUI actions, serves as a fundamental basis to autonomous GUI agents. While existing grounding models achieve promising performance to simulate the mouse…

Human-Computer Interaction · Computer Science 2026-01-13 Zeyi Liao , Yadong Lu , Boyu Gou , Huan Sun , Ahmed Awadallah

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot…

Artificial Intelligence · Computer Science 2025-11-18 SeokJoo Kwak , Jihoon Kim , Boyoun Kim , Jung Jae Yoon , Wooseok Jang , Jeonghoon Hong , Jaeho Yang , Yeong-Dae Kwon

Visual grounding (VG) aims at locating the foreground entities that match the given natural language expressions. Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally…

Computer Vision and Pattern Recognition · Computer Science 2024-05-27 Wenxuan Wang , Yisi Zhang , Xingjian He , Yichen Yan , Zijia Zhao , Xinlong Wang , Jing Liu

Graphical User Interface (GUI) element grounding (precisely locating elements on screenshots based on natural language instructions) is fundamental for agents interacting with GUIs. Deploying this capability directly on resource-constrained…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Hongxin Li , Yuntao Chen , Zhaoxiang Zhang

Vision-Language Models (VLMs) frequently misread values, hallucinate details, and confuse overlapping elements in charts. Current approaches rely solely on pixel interpretation, creating a Pixel-Only Bottleneck: agents treat interactive…

Computation and Language · Computer Science 2026-04-24 Yiyang Lu , Woong Shin , Ahmad Maroof Karimi , Feiyi Wang , Jie Ren , Evgenia Smirni

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually…

Artificial Intelligence · Computer Science 2025-12-02 Aiden Yiliu Li , Bizhi Yu , Daoan Lei , Tianhe Ren , Shilong Liu

Visual Grounding (VG) aims to locate the most relevant region in an image, based on a flexible natural language query but not a pre-defined label, thus it can be a more useful technique than object detection in practice. Most…

Computer Vision and Pattern Recognition · Computer Science 2019-03-19 Chaorui Deng , Qi Wu , Guanghui Xu , Zhuliang Yu , Yanwu Xu , Kui Jia , Mingkui Tan

Detecting Graphical User Interface (GUI) elements in GUI images is a domain-specific object detection task. It supports many software engineering tasks, such as GUI animation and testing, GUI search and code generation. Existing studies for…

Computer Vision and Pattern Recognition · Computer Science 2020-09-08 Jieshan Chen , Mulong Xie , Zhenchang Xing , Chunyang Chen , Xiwei Xu , Liming Zhu , Guoqiang Li

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper:…

Computer Vision and Pattern Recognition · Computer Science 2025-09-16 Weiming Li , Yan Shao , Jing Yang , Yujing Lu , Ling Zhong , Yuhan Wang , Manni Duan

Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies,…

Computer Vision and Pattern Recognition · Computer Science 2021-07-08 Guoshun Nan , Rui Qiao , Yao Xiao , Jun Liu , Sicong Leng , Hao Zhang , Wei Lu

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be…

Human-Computer Interaction · Computer Science 2024-02-26 Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu

Graphical User Interface (GUI) tasks are vital for automating workflows such as software testing, user interface navigation. For users, the GUI is the most intuitive platform for interacting with a computer. Previous work identified a key…

Computation and Language · Computer Science 2025-03-10 Zheng Hui , Yinheng Li , Dan zhao , Tianyi Chen , Colby Banbury , Kazuhito Koishida

Visual grounding aims to ground an image region through natural language, which heavily relies on cross-modal alignment. Most existing methods transfer visual/linguistic knowledge separately by fully fine-tuning uni-modal pre-trained…

Computer Vision and Pattern Recognition · Computer Science 2025-03-03 Liangtao Shi , Ting Liu , Xiantao Hu , Yue Hu , Quanjun Yin , Richang Hong

Machine Interpreting systems are currently implemented as unimodal, real-time speech-to-speech architectures, processing translation exclusively on the basis of the linguistic signal. Such reliance on a single modality, however, constrains…

Computation and Language · Computer Science 2025-09-30 Claudio Fantinuoli

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Siqi Pei , Liang Tang , Tiaonan Duan , Long Chen , Shuxian Li , Kaer Huang , Yanzhe Jing , Yiqiang Yan , Bo Zhang , Chenghao Jiang , Borui Zhang , Jiwen Lu
‹ Prev 1 2 3 10 Next ›