Related papers: Step-GUI Technical Report

GUI-Shepherd: Reliable Process Reward and Verification for Long-Sequence GUI Tasks

Autonomous agents for long-sequence Graphical User Interface tasks are hindered by sparse rewards and the intractable credit assignment problem. To address these challenges, we introduce GUI-Shepherd, a Process Reward Model that provides…

Artificial Intelligence · Computer Science 2025-09-30 Cong Chen , Kaixiang Ji , Hao Zhong , Muzhi Zhu , Anzhou Li , Guo Gan , Ziyuan Huang , Cheng Zou , Jiajia Liu , Jingdong Chen , Hao Chen , Chunhua Shen

AndroidDaily: A Verifiable Benchmark for Mobile GUI Agents on Real-World Closed-Source Applications

The rapid development of GUI foundation models and mobile GUI agents has spurred numerous evaluation benchmarks, yet most rely on simulated environments or open-source applications, leaving real-world closed-source applications largely…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Yifan Sui , Xin Huang , Hongbing Li , Fang Xu , Jiahe Lv , Haolong Yan , Yeqing Shen , Litao Liu , Zhimin Fan , Ziyang Meng , Jia Wang , Junbo Qi , Kaijun Tan , Zheng Ge , Xiangyu Zhang , Daxin Jiang , Osamu Yoshie

AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning

The recent progress of large language model agents has opened new possibilities for automating tasks through graphical user interfaces (GUIs), especially in mobile environments where intelligent interaction can greatly enhance usability.…

Artificial Intelligence · Computer Science 2025-06-18 Zhong Zhang , Yaxi Lu , Yikun Fu , Yupeng Huo , Shenzhi Yang , Yesai Wu , Han Si , Xin Cong , Haotian Chen , Yankai Lin , Jie Xie , Wei Zhou , Wang Xu , Yuanheng Zhang , Zhou Su , Zhongwu Zhai , Xiaoming Liu , Yudong Mei , Jianming Xu , Hongyan Tian , Chongyi Wang , Chi Chen , Yuan Yao , Zhiyuan Liu , Maosong Sun

Mobile-Agent-v3: Fundamental Agents for GUI Automation

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question…

Artificial Intelligence · Computer Science 2025-09-03 Jiabo Ye , Xi Zhang , Haiyang Xu , Haowei Liu , Junyang Wang , Zhaoqing Zhu , Ziwei Zheng , Feiyu Gao , Junjie Cao , Zhengxi Lu , Jitong Liao , Qi Zheng , Fei Huang , Jingren Zhou , Ming Yan

UI-Venus-1.5 Technical Report

GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging. In this report, we present UI-Venus-1.5, a…

Computer Vision and Pattern Recognition · Computer Science 2026-02-25 Venus Team , Changlong Gao , Zhangxuan Gu , Yulin Liu , Xinyu Qiu , Shuheng Shen , Yue Wen , Tianyu Xia , Zhenyu Xu , Zhengwen Zeng , Beitong Zhou , Xingran Zhou , Weizhi Chen , Sunhao Dai , Jingya Dou , Yichen Gong , Yuan Guo , Zhenlin Guo , Feng Li , Qian Li , Jinzhen Lin , Yuqi Zhou , Linchao Zhu , Liang Chen , Zhenyu Guo , Changhua Meng , Weiqiang Wang

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Tool calling has emerged as a critical capability for AI agents. In contrast to conventional tool calling frameworks that rely on static, provider-specific tool definitions, the Model Context Protocol (MCP) offers a unified interface to…

Computation and Language · Computer Science 2026-05-26 Ming Yin , Dinghan Shen , Silei Xu , Sixun Dong , Mian Zhang , Yebowen Hu , Shujian Liu , Jianbing Han , Simin Ma , Song Wang , Sathish Reddy Indurthi , Xun Wang , Yiran Chen , Kaiqiang Song

EE-MCP: Self-Evolving MCP-GUI Agents via Automated Environment Generation and Experience Learning

Computer-use agents that combine GUI interaction with structured API calls via the Model Context Protocol (MCP) show promise for automating software tasks. However, existing approaches lack a principled understanding of how agents should…

Artificial Intelligence · Computer Science 2026-04-14 Tiantian He , Yihang Chen , Keyue Jiang , Ka Yiu Lee , Kaiwen Zhou , Kun Shao , Shuai Wang

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Xuehui Wang , Zhenyu Wu , JingJing Xie , Zichen Ding , Bowen Yang , Zehao Li , Zhaoyang Liu , Qingyun Li , Xuan Dong , Zhe Chen , Weiyun Wang , Xiangyu Zhao , Jixuan Chen , Haodong Duan , Tianbao Xie , Chenyu Yang , Shiqian Su , Yue Yu , Yuan Huang , Yiqian Liu , Xiao Zhang , Yanting Zhang , Xiangyu Yue , Weijie Su , Xizhou Zhu , Wei Shen , Jifeng Dai , Wenhai Wang

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke…

Artificial Intelligence · Computer Science 2026-05-26 Henry Hengyuan Zhao , Kaiming Yang , Wendi Yu , Difei Gao , Mike Zheng Shou

AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their…

Artificial Intelligence · Computer Science 2025-10-22 Ho Fai Leung , Xiaoyan Xi , Fei Zuo

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents

The paper introduces GUI-Owl-1.5, the latest native GUI agent model that features instruct/thinking variants in multiple sizes (2B/4B/8B/32B/235B) and supports a range of platforms (desktop, mobile, browser, and more) to enable cloud-edge…

Artificial Intelligence · Computer Science 2026-02-20 Haiyang Xu , Xi Zhang , Haowei Liu , Junyang Wang , Zhaozai Zhu , Shengjie Zhou , Xuhao Hu , Feiyu Gao , Junjie Cao , Zihua Wang , Zhiyuan Chen , Jitong Liao , Qi Zheng , Jiahui Zeng , Ze Xu , Shuai Bai , Junyang Lin , Jingren Zhou , Ming Yan

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Recent years have witnessed a rapid development of mobile GUI agents powered by large language models (LLMs), which can autonomously execute diverse device-control tasks based on natural language instructions. The increasing accuracy of…

Cryptography and Security · Computer Science 2026-04-15 Guohong Liu , Jialei Ye , Jiacheng Liu , Yuanchun Li , Wei Liu , Pengzhi Gao , Jian Luan , Yunxin Liu

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are…

Human-Computer Interaction · Computer Science 2026-05-20 Felix Henry , Xiaochen Lin , Jiangyou Zhu , Yangfan , Bingqian Zhang , Min Chen , Shiyu Huang

MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Despite the rapid progress of Multimodal Large Language Models (MLLMs), their ability to perform reliable visual grounding in high-stakes clinical software environments remains underexplored. Existing GUI benchmarks largely focus on…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Rozain Shakeel , Abdul Rahman Mohammad Ali , Muneeb Mushtaq , Tausifa Jan Saleem , Tajamul Ashraf

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents

The development of GUI agents could revolutionize the next generation of human-computer interaction. Motivated by this vision, we present MAI-UI, a family of foundation GUI agents spanning the full spectrum of sizes, including 2B, 8B, 32B,…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Hanzhang Zhou , Xu Zhang , Panrong Tong , Jianan Zhang , Liangyu Chen , Quyu Kong , Chenglin Cai , Chen Liu , Yue Wang , Jingren Zhou , Steven Hoi

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents

GUI agents drive applications through their visual interfaces instead of programmatic APIs, interacting with arbitrary software via taps, swipes, and keystrokes, reaching a long tail of applications that CLI-based agents cannot. Yet…

Machine Learning · Computer Science 2026-04-14 Fei Tang , Zhiqiong Lu , Boxuan Zhang , Weiming Lu , Jun Xiao , Yueting Zhuang , Yongliang Shen

SecAgent: Efficient Mobile GUI Agent with Semantic Context

Mobile Graphical User Interface (GUI) agents powered by multimodal large language models have demonstrated promising capabilities in automating complex smartphone tasks. However, existing approaches face two critical limitations: the…

Computer Vision and Pattern Recognition · Computer Science 2026-04-01 Yiping Xie , Song Chen , Jingxuan Xing , Wei Jiang , Zekun Zhu , Yingyao Wang , Pi Bu , Jun Song , Yuning Jiang , Bo Zheng

SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking

Mobile GUI agents powered by large language models have progressed rapidly, creating urgent needs for realistic and comprehensive evaluation. Existing benchmarks prioritize reproducibility but are often limited to open-source apps or…

Artificial Intelligence · Computer Science 2026-05-26 Guohong Liu , Jialei Ye , Pengzhi Gao , Wei Liu , Jian Luan , Yunxin Liu , Yuanchun Li

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction

The Graphical User Interface (GUI) is pivotal for human interaction with the digital world, enabling efficient device control and the completion of complex tasks. Recent progress in Large Language Models (LLMs) and Vision Language Models…

Artificial Intelligence · Computer Science 2024-06-14 Danyang Zhang , Zhennan Shen , Rui Xie , Situo Zhang , Tianbao Xie , Zihan Zhao , Siyuan Chen , Lu Chen , Hongshen Xu , Ruisheng Cao , Kai Yu

POINTS-GUI-G: GUI-Grounding Journey

The rapid advancement of vision-language models has catalyzed the emergence of GUI agents, which hold immense potential for automating complex tasks, from online shopping to flight booking, thereby alleviating the burden of repetitive…

Computer Vision and Pattern Recognition · Computer Science 2026-02-09 Zhongyin Zhao , Yuan Liu , Yikun Liu , Haicheng Wang , Le Tian , Xiao Zhou , Yangxiu You , Zilin Yu , Yang Yu , Jie Zhou