Related papers: GUIDE: Graphical User Interface Data for Execution

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the…

Artificial Intelligence · Computer Science 2024-11-05 Xuetian Chen , Hangcheng Li , Jiaqing Liang , Sihang Jiang , Deqing Yang

ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search

Recent advances in Multimodal Large Language Models (MLLMs) have enabled autonomous agents to interact with computers via Graphical User Interfaces (GUIs), where accurately localizing the coordinates of interface elements (e.g., buttons) is…

Machine Learning · Computer Science 2025-05-27 Hyunseok Lee , Jeonghoon Kim , Beomjun Kim , Jihoon Tack , Chansong Jo , Jaehong Lee , Cheonbok Park , Sookyo In , Jinwoo Shin , Kang Min Yoo

V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM

In the rapidly evolving landscape of AI research and application, Multimodal Large Language Models (MLLMs) have emerged as a transformative force, adept at interpreting and integrating information from diverse modalities such as text,…

Artificial Intelligence · Computer Science 2024-07-23 Abdur Rahman , Rajat Chawla , Muskaan Kumar , Arkajit Datta , Adarsh Jha , Mukunda NS , Ishaan Bhola

AUTONODE: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and…

Artificial Intelligence · Computer Science 2024-05-28 Arkajit Datta , Tushar Verma , Rajat Chawla , Mukunda N. S , Ishaan Bhola

AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

Large Language Model (LLM) based agents have demonstrated proficiency in multi-step interactions with graphical user interfaces (GUIs). While most research focuses on improving single-task performance, practical scenarios often involve…

Artificial Intelligence · Computer Science 2026-05-21 Minghao Chen , Xinyi Hu , Zhou Yu , Yufei Yin

GUI-World: A Video Benchmark and Dataset for Multimodal GUI-oriented Understanding

Recently, Multimodal Large Language Models (MLLMs) have been used as agents to control keyboard and mouse inputs by directly perceiving the Graphical User Interface (GUI) and generating corresponding commands. However, current agents…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Dongping Chen , Yue Huang , Siyuan Wu , Jingyu Tang , Liuyi Chen , Yilin Bai , Zhigang He , Chenlong Wang , Huichi Zhou , Yiqiang Li , Tianshuo Zhou , Yue Yu , Chujie Gao , Qihui Zhang , Yi Gui , Zhen Li , Yao Wan , Pan Zhou , Jianfeng Gao , Lichao Sun

GUIDE: A Benchmark for Understanding and Assisting Users in Open-Ended GUI Tasks

Graphical User Interface (GUI) agents have the potential to assist users in interacting with complex software (e.g., PowerPoint, Photoshop). While prior research has primarily focused on automating user actions through clicks and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Saelyne Yang , Jaesang Yu , Yi-Hao Peng , Kevin Qinghong Lin , Jae Won Cho , Yale Song , Juho Kim

GraphPilot: GUI Task Automation with One-Step LLM Reasoning Powered by Knowledge Graph

Mobile graphical user interface (GUI) agents are designed to automate everyday tasks on smartphones. Recent advances in large language models (LLMs) have significantly enhanced the capabilities of mobile GUI agents. However, most…

Human-Computer Interaction · Computer Science 2026-01-27 Mingxian Yu , Siqi Luo , Xu Chen

GUI Agents with Foundation Models: A Comprehensive Survey

Recent advances in foundation models, particularly Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs), have facilitated the development of intelligent agents capable of performing complex tasks. By leveraging the…

Artificial Intelligence · Computer Science 2025-02-14 Shuai Wang , Weiwen Liu , Jingxuan Chen , Yuqi Zhou , Weinan Gan , Xingshan Zeng , Yuhan Che , Shuai Yu , Xinlong Hao , Kun Shao , Bin Wang , Chuhan Wu , Yasheng Wang , Ruiming Tang , Jianye Hao

Breaking the Data Barrier -- Building GUI Agents Through Task Generalization

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of…

Artificial Intelligence · Computer Science 2025-04-16 Junlei Zhang , Zichen Ding , Chang Ma , Zijie Chen , Qiushi Sun , Zhenzhong Lan , Junxian He

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs

The recent advancements introduced by Large Language Models (LLMs) have transformed how Artificial Intelligence (AI) can support complex, real world tasks, pushing research outside the text boundaries towards multi modal contexts and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Federico Toschi , Nicolò Brunello , Andrea Sassella , Vincenzo Scotti , Mark James Carman

A Deep User Interface for Exploring LLaMa

The growing popularity and widespread adoption of large language models (LLMs) necessitates the development of tools that enhance the effectiveness of user interactions with these models. Understanding the structures and functions of these…

Human-Computer Interaction · Computer Science 2025-03-03 Divya Perumal , Swaroop Panda

You Only Look at Screens: Multimodal Chain-of-Action Agents

Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models…

Computation and Language · Computer Science 2024-06-10 Zhuosheng Zhang , Aston Zhang

GraphicBench: A Planning Benchmark for Graphic Design with Language Agents

Large Language Model (LLM)-powered agents have unlocked new possibilities for automating human tasks. While prior work has focused on well-defined tasks with specified goals, the capabilities of agents in creative design tasks with…

Artificial Intelligence · Computer Science 2025-04-17 Dayeon Ki , Tianyi Zhou , Marine Carpuat , Gang Wu , Puneet Mathur , Viswanathan Swaminathan

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Yucheng Han , Chi Zhang , Xin Chen , Xu Yang , Zhibin Wang , Gang Yu , Bin Fu , Hanwang Zhang

General Scene Adaptation for Vision-and-Language Navigation

Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner.…

Computer Vision and Pattern Recognition · Computer Science 2025-01-30 Haodong Hong , Yanyuan Qiao , Sen Wang , Jiajun Liu , Qi Wu

AppAgent v2: Advanced Agent for Flexible Mobile Interactions

With the advancement of Multimodal Large Language Models (MLLM), LLM-driven visual agents are increasingly impacting software interfaces, particularly those with graphical user interfaces. This work introduces a novel LLM-based multimodal…

Human-Computer Interaction · Computer Science 2025-09-18 Yanda Li , Chi Zhang , Wenjia Jiang , Wanqi Yang , Bin Fu , Pei Cheng , Xin Chen , Ling Chen , Yunchao Wei

GUI-Bee: Align GUI Action Grounding to Novel Environments via Autonomous Exploration

Graphical User Interface (GUI) action grounding is a critical step in GUI automation that maps language instructions to actionable elements on GUI screens. Most recent works of GUI action grounding leverage large GUI datasets to fine-tune…

Computation and Language · Computer Science 2025-01-28 Yue Fan , Handong Zhao , Ruiyi Zhang , Yu Shen , Xin Eric Wang , Gang Wu

MLaGA: Multimodal Large Language and Graph Assistant

Large Language Models (LLMs) have demonstrated substantial efficacy in advancing graph-structured data analysis. Prevailing LLM-based graph methods excel in adapting LLMs to text-rich graphs, wherein node attributes are text descriptions.…

Artificial Intelligence · Computer Science 2025-06-04 Dongzhe Fan , Yi Fang , Jiajin Liu , Djellel Difallah , Qiaoyu Tan

OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding

Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Jiancong Xie , Wenjin Wang , Zhuomeng Zhang , Zihan Liu , Qi Liu , Ke Feng , Zixun Sun , Yuedong Yang