Related papers: PrecisionCUA: Iterative Visual Refinement for Pixe…

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yu Zhao , Wei-Ning Chen , Huseyin Atahan Inan , Samuel Kessler , Lu Wang , Lukas Wutschitz , Fangkai Yang , Chaoyun Zhang , Pasquale Minervini , Saravan Rajmohan , Robert Sim

Improved GUI Grounding via Iterative Narrowing

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Anthony Nguyen

Phi-Ground Tech Report: Advancing Perception in GUI Grounding

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical…

Computer Vision and Pattern Recognition · Computer Science 2025-08-01 Miaosen Zhang , Ziqiang Xu , Jialiang Zhu , Qi Dai , Kai Qiu , Yifan Yang , Chong Luo , Tianyi Chen , Justin Wagle , Tim Franklin , Baining Guo

R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Joonhyung Park , Peng Tang , Sagnik Das , Srikar Appalaraju , Kunwar Yashraj Singh , R. Manmatha , Shabnam Ghadar

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot…

Artificial Intelligence · Computer Science 2025-11-18 SeokJoo Kwak , Jihoon Kim , Boyoun Kim , Jung Jae Yoon , Wooseok Jang , Jeonghoon Hong , Jaeho Yang , Yeong-Dae Kwon

Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually…

Artificial Intelligence · Computer Science 2025-12-02 Aiden Yiliu Li , Bizhi Yu , Daoan Lei , Tianhe Ren , Shilong Liu

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xunzhuo Liu , Bowei He , Xue Liu , Andy Luo , Haichen Zhang , Huamin Chen

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to…

Artificial Intelligence · Computer Science 2026-01-16 Chen Chen , Jiawei Shao , Dakuan Lu , Haoyi Hu , Xiangcheng Liu , Hantao Yao , Wu Liu

An Efficient Training Pipeline for Reasoning Graphical User Interface Agents

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work…

Artificial Intelligence · Computer Science 2025-11-17 Georgios Pantazopoulos , Eda B. Özyiğit

Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems

Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate…

Artificial Intelligence · Computer Science 2025-03-11 Fei Tang , Yongliang Shen , Hang Zhang , Siqi Chen , Guiyang Hou , Wenqi Zhang , Wenqiao Zhang , Kaitao Song , Weiming Lu , Yueting Zhuang

Computer-Use Agents as Judges for Generative User Interface

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Kevin Qinghong Lin , Siyuan Hu , Linjie Li , Zhengyuan Yang , Lijuan Wang , Philip Torr , Mike Zheng Shou

Aria-UI: Visual Grounding for GUI Instructions

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to…

Human-Computer Interaction · Computer Science 2025-07-09 Yuhao Yang , Yue Wang , Dongxu Li , Ziyang Luo , Bei Chen , Chao Huang , Junnan Li

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer…

Human-Computer Interaction · Computer Science 2025-07-21 El Hassane Ettifouri , Jessica López Espejel , Laura Minkova , Tassnim Dardouri , Walid Dahhane

Grounding Computer Use Agents on Human Demonstrations

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop…

Machine Learning · Computer Science 2025-11-11 Aarash Feizi , Shravan Nayak , Xiangru Jian , Kevin Qinghong Lin , Kaixin Li , Rabiul Awal , Xing Han Lù , Johan Obando-Ceron , Juan A. Rodriguez , Nicolas Chapados , David Vazquez , Adriana Romero-Soriano , Reihaneh Rabbany , Perouz Taslakian , Christopher Pal , Spandana Gella , Sai Rajeswar

Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with…

Machine Learning · Computer Science 2026-04-24 Wenkai Wang , Xiyun Li , Hongcan Guo , Wenhao Yu , Tianqing Fang , Haitao Mi , Dong Yu , Shengyu Zhang

Training Computer Use Agents to Assess the Usability of Graphical User Interfaces

Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer…

Computation and Language · Computer Science 2026-04-30 Alice Gao , Weixi Tong , Rishab Vempati , Katharina Reinecke , R. Benjamin Shapiro , Tianyi Zhang , Jason Wu

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Siqi Pei , Liang Tang , Tiaonan Duan , Long Chen , Shuxian Li , Kaer Huang , Yanzhe Jing , Yiqiang Yan , Bo Zhang , Chenghao Jiang , Borui Zhang , Jiwen Lu

ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with…

Artificial Intelligence · Computer Science 2026-05-13 Xuhao Hu , Xi Zhang , Haiyang Xu , Kyle Qiao , Jingyi Yang , Xuanjing Huang , Jing Shao , Ming Yan , Jieping Ye

Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be…

Human-Computer Interaction · Computer Science 2024-02-26 Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu