English
Related papers

Related papers: GUIDE: Graphical User Interface Data for Execution

200 papers

Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the…

Artificial Intelligence · Computer Science 2024-11-05 Xuetian Chen , Hangcheng Li , Jiaqing Liang , Sihang Jiang , Deqing Yang

Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is…

Machine Learning · Computer Science 2025-05-27 Hyunseok Lee , Jeonghoon Kim , Beomjun Kim , Jihoon Tack , Chansong Jo , Jaehong Lee , Cheonbok Park , Sookyo In , Jinwoo Shin , Kang Min Yoo

In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text,…

Artificial Intelligence · Computer Science 2024-07-23 Abdur Rahman , Rajat Chawla , Muskaan Kumar , Arkajit Datta , Adarsh Jha , Mukunda NS , Ishaan Bhola

In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and…

Artificial Intelligence · Computer Science 2024-05-28 Arkajit Datta , Tushar Verma , Rajat Chawla , Mukunda N. S , Ishaan Bhola

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve…

Artificial Intelligence · Computer Science 2026-05-21 Minghao Chen , Xinyi Hu , Zhou Yu , Yufei Yin

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Saelyne Yang , Jaesang Yu , Yi-Hao Peng , Kevin Qinghong Lin , Jae Won Cho , Yale Song , Juho Kim

Mobile graphical user interface (GUI) agents are designed to automate everyday tasks on smartphones. Recent advances in large language models (LLMs) have significantly enhanced the capabilities of mobile GUI agents. However, most…

Human-Computer Interaction · Computer Science 2026-01-27 Mingxian Yu , Siqi Luo , Xu Chen

Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent agents capable of performing complex tasks. By leveraging the…

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of…

Artificial Intelligence · Computer Science 2025-04-16 Junlei Zhang , Zichen Ding , Chang Ma , Zijie Chen , Qiushi Sun , Zhenzhong Lan , Junxian He

The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Federico Toschi , Nicolò Brunello , Andrea Sassella , Vincenzo Scotti , Mark James Carman

The growing popularity and widespread adoption of large language models (LLMs) necessitates the development of tools that enhance the effectiveness of user interactions with these models. Understanding the structures and functions of these…

Human-Computer Interaction · Computer Science 2025-03-03 Divya Perumal , Swaroop Panda

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models…

Computation and Language · Computer Science 2024-06-10 Zhuosheng Zhang , Aston Zhang

Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with…

Artificial Intelligence · Computer Science 2025-04-17 Dayeon Ki , Tianyi Zhou , Marine Carpuat , Gang Wu , Puneet Mathur , Viswanathan Swaminathan

Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Yucheng Han , Chi Zhang , Xin Chen , Xu Yang , Zhibin Wang , Gang Yu , Bin Fu , Hanwang Zhang

Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner.…

Computer Vision and Pattern Recognition · Computer Science 2025-01-30 Haodong Hong , Yanyuan Qiao , Sen Wang , Jiajun Liu , Qi Wu

With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal…

Human-Computer Interaction · Computer Science 2025-09-18 Yanda Li , Chi Zhang , Wenjia Jiang , Wanqi Yang , Bin Fu , Pei Cheng , Xin Chen , Ling Chen , Yunchao Wei

Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune…

Computation and Language · Computer Science 2025-01-28 Yue Fan , Handong Zhao , Ruiyi Zhang , Yu Shen , Xin Eric Wang , Gang Wu

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions.…

Artificial Intelligence · Computer Science 2025-06-04 Dongzhe Fan , Yi Fang , Jiajin Liu , Djellel Difallah , Qiaoyu Tan

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Jiancong Xie , Wenjin Wang , Zhuomeng Zhang , Zihan Liu , Qi Liu , Ke Feng , Zixun Sun , Yuedong Yang
‹ Prev 1 2 3 10 Next ›