English
Related papers

Related papers: PrecisionCUA: Iterative Visual Refinement for Pixe…

200 papers

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yu Zhao , Wei-Ning Chen , Huseyin Atahan Inan , Samuel Kessler , Lu Wang , Lukas Wutschitz , Fangkai Yang , Chaoyun Zhang , Pasquale Minervini , Saravan Rajmohan , Robert Sim

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in…

Computer Vision and Pattern Recognition · Computer Science 2025-09-12 Anthony Nguyen

With the development of multimodal reasoning models, Computer Use Agents (CUAs), akin to Jarvis from \textit{"Iron Man"}, are becoming a reality. GUI grounding is a core component for CUAs to execute actual actions, similar to mechanical…

Computer Vision and Pattern Recognition · Computer Science 2025-08-01 Miaosen Zhang , Ziqiang Xu , Jialiang Zhu , Qi Dai , Kai Qiu , Yifan Yang , Chong Luo , Tianyi Chen , Justin Wagle , Tim Franklin , Baining Guo

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Joonhyung Park , Peng Tang , Sagnik Das , Srikar Appalaraju , Kunwar Yashraj Singh , R. Manmatha , Shabnam Ghadar

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot…

Artificial Intelligence · Computer Science 2025-11-18 SeokJoo Kwak , Jihoon Kim , Boyoun Kim , Jung Jae Yoon , Wooseok Jang , Jeonghoon Hong , Jaeho Yang , Yeong-Dae Kwon

GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually…

Artificial Intelligence · Computer Science 2025-12-02 Aiden Yiliu Li , Bizhi Yu , Daoan Lei , Tianhe Ren , Shilong Liu

Computer-using agents (CUAs) act directly on graphical user interfaces, yet their perception of the screen is often unreliable. Existing work largely treats these failures as performance limitations, asking whether an action succeeds,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xunzhuo Liu , Bowei He , Xue Liu , Andy Luo , Haichen Zhang , Huamin Chen

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to…

Artificial Intelligence · Computer Science 2026-01-16 Chen Chen , Jiawei Shao , Dakuan Lu , Haoyi Hu , Xiangcheng Liu , Hantao Yao , Wu Liu

Visual grounding is the task of localising image regions from natural language queries and is critical for reasoning capable Graphical User Interface agents. Many existing methods rely on massive, noisy synthetic datasets. This work…

Artificial Intelligence · Computer Science 2025-11-17 Georgios Pantazopoulos , Eda B. Özyiğit

Humans can flexibly switch between different modes of thinking based on task complexity: from rapid intuitive judgments to in-depth analytical understanding. However, current Graphical User Interface (GUI) grounding systems which locate…

Artificial Intelligence · Computer Science 2025-03-11 Fei Tang , Yongliang Shen , Hang Zhang , Siqi Chen , Guiyang Hou , Wenqi Zhang , Wenqiao Zhang , Kaitao Song , Weiming Lu , Yueting Zhuang

Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans--prioritizing aesthetics and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Kevin Qinghong Lin , Siyuan Hu , Linjie Li , Zhengyuan Yang , Lijuan Wang , Philip Torr , Mike Zheng Shou

Digital agents for automating tasks across different platforms by directly manipulating the GUIs are increasingly important. For these agents, grounding from language instructions to target elements remains a significant challenge due to…

Human-Computer Interaction · Computer Science 2025-07-09 Yuhao Yang , Yue Wang , Dongxu Li , Ziyang Luo , Bei Chen , Chao Huang , Junnan Li

Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer…

Human-Computer Interaction · Computer Science 2025-07-21 El Hassane Ettifouri , Jessica López Espejel , Laura Minkova , Tassnim Dardouri , Walid Dahhane

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop…

Graphical User Interface (GUI) grounding requires mapping natural language instructions to precise pixel coordinates. However, due to visually homogeneous elements and dense layouts, models typically grasp semantic intent yet struggle with…

Machine Learning · Computer Science 2026-04-24 Wenkai Wang , Xiyun Li , Hongcan Guo , Wenhao Yu , Tianqing Fang , Haitao Mi , Dong Yu , Shengyu Zhang

Usability testing with experts and potential users can assess the effectiveness, efficiency, and user satisfaction of graphical user interfaces (GUIs) but doing so remains a costly and time-intensive process. Prior work has used computer…

Computation and Language · Computer Science 2026-04-30 Alice Gao , Weixi Tong , Rishab Vempati , Katharina Reinecke , R. Benjamin Shapiro , Tianyi Zhang , Jason Wu

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots…

Computer Vision and Pattern Recognition · Computer Science 2026-03-19 Siqi Pei , Liang Tang , Tiaonan Duan , Long Chen , Shuxian Li , Kaer Huang , Yanzhe Jing , Yiqiang Yan , Bo Zhang , Chenghao Jiang , Borui Zhang , Jiwen Lu

Computer Use Agents (CUAs) can act through both atomic GUI actions, such as click and type, and high-level tool calls, such as API-based file operations, but this hybrid action space often leaves them uncertain about when to continue with…

Artificial Intelligence · Computer Science 2026-05-13 Xuhao Hu , Xi Zhang , Haiyang Xu , Kyle Qiao , Jingyi Yang , Xuanjing Huang , Jing Shao , Ming Yan , Jieping Ye

Visual grounding is a task to locate the target indicated by a natural language expression. Existing methods extend the generic object detection framework to this problem. They base the visual grounding on the features from pre-generated…

Computer Vision and Pattern Recognition · Computer Science 2022-06-09 Li Yang , Yan Xu , Chunfeng Yuan , Wei Liu , Bing Li , Weiming Hu

Graphical User Interface (GUI) agents are designed to automate complex tasks on digital devices, such as smartphones and desktops. Most existing GUI agents interact with the environment through extracted structured data, which can be…

Human-Computer Interaction · Computer Science 2024-02-26 Kanzhi Cheng , Qiushi Sun , Yougang Chu , Fangzhi Xu , Yantao Li , Jianbing Zhang , Zhiyong Wu
‹ Prev 1 2 3 10 Next ›