English
Related papers

Related papers: Step-GUI Technical Report

200 papers

Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides…

Artificial Intelligence · Computer Science 2025-09-30 Cong Chen , Kaixiang Ji , Hao Zhong , Muzhi Zhu , Anzhou Li , Guo Gan , Ziyuan Huang , Cheng Zou , Jiajia Liu , Jingdong Chen , Hao Chen , Chunhua Shen

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Yifan Sui , Xin Huang , Hongbing Li , Fang Xu , Jiahe Lv , Haolong Yan , Yeqing Shen , Litao Liu , Zhimin Fan , Ziyang Meng , Jia Wang , Junbo Qi , Kaijun Tan , Zheng Ge , Xiangyu Zhang , Daxin Jiang , Osamu Yoshie

The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability.…

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question…

Artificial Intelligence · Computer Science 2025-09-03 Jiabo Ye , Xi Zhang , Haiyang Xu , Haowei Liu , Junyang Wang , Zhaoqing Zhu , Ziwei Zheng , Feiyu Gao , Junjie Cao , Zhengxi Lu , Jitong Liao , Qi Zheng , Fei Huang , Jingren Zhou , Ming Yan

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging. In this report, we present UI-Venus-1.5, a…

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to…

Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should…

Artificial Intelligence · Computer Science 2026-04-14 Tiantian He , Yihang Chen , Keyue Jiang , Ka Yiu Lee , Kaiwen Zhou , Kun Shao , Shuai Wang

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation,…

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke…

Artificial Intelligence · Computer Science 2026-05-26 Henry Hengyuan Zhao , Kaiming Yang , Wendi Yu , Difei Gao , Mike Zheng Shou

On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their…

Artificial Intelligence · Computer Science 2025-10-22 Ho Fai Leung , Xiaoyan Xi , Fei Zuo

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of…

Cryptography and Security · Computer Science 2026-04-15 Guohong Liu , Jialei Ye , Jiacheng Liu , Yuanchun Li , Wei Liu , Pengzhi Gao , Jian Luan , Yunxin Liu

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are…

Human-Computer Interaction · Computer Science 2026-05-20 Felix Henry , Xiaochen Lin , Jiangyou Zhu , Yangfan , Bingqian Zhang , Min Chen , Shiyu Huang

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Rozain Shakeel , Abdul Rahman Mohammad Ali , Muneeb Mushtaq , Tausifa Jan Saleem , Tajamul Ashraf

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Hanzhang Zhou , Xu Zhang , Panrong Tong , Jianan Zhang , Liangyu Chen , Quyu Kong , Chenglin Cai , Chen Liu , Yue Wang , Jingren Zhou , Steven Hoi

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet…

Machine Learning · Computer Science 2026-04-14 Fei Tang , Zhiqiong Lu , Boxuan Zhang , Weiming Lu , Jun Xiao , Yueting Zhuang , Yongliang Shen

Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Yiping Xie , Song Chen , Jingxuan Xing , Wei Jiang , Zekun Zhu , Yingyao Wang , Pi Bu , Jun Song , Yuning Jiang , Bo Zheng

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or…

Artificial Intelligence · Computer Science 2026-05-26 Guohong Liu , Jialei Ye , Pengzhi Gao , Wei Liu , Jian Luan , Yunxin Liu , Yuanchun Li

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models…

Artificial Intelligence · Computer Science 2024-06-14 Danyang Zhang , Zhennan Shen , Rui Xie , Situo Zhang , Tianbao Xie , Zihan Zhao , Siyuan Chen , Lu Chen , Hongshen Xu , Ruisheng Cao , Kai Yu

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive…

Computer Vision and Pattern Recognition · Computer Science 2026-02-09 Zhongyin Zhao , Yuan Liu , Yikun Liu , Haicheng Wang , Le Tian , Xiao Zhou , Yangxiu You , Zilin Yu , Yang Yu , Jie Zhou
‹ Prev 1 2 3 10 Next ›