Related papers: AppSelectBench: Application-Level Tool Selection B…

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing…

Artificial Intelligence · Computer Science 2026-05-26 Kean Shi , Zihang Li , Tianyi Ma , Zengji Tu , Jialong Wu , Xinbo Xu , Qingyao Yang , Ruoyu Wu , Weichu Xie , Ming Wu , Jason Zeng , Michael Heinrich , Elvis Zhang , Liang Chen , Kuan Li , Baobao Chang

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. Recent work demonstrates that these API-based agents exhibit…

Software Engineering · Computer Science 2025-01-24 Haiyang Shen , Yue Li , Desong Meng , Dongqi Cai , Sheng Qi , Li Zhang , Mengwei Xu , Yun Ma

AutomationBench

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and…

Artificial Intelligence · Computer Science 2026-04-22 Daniel Shepard , Robin Salimans

UI-Bench: A Benchmark for Evaluating Design Capabilities of AI Text-to-App Tools

AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across…

Computation and Language · Computer Science 2025-09-05 Sam Jung , Agustin Garcinuno , Spencer Mateega

A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices -- such as desktops, mobile phones, and web platforms -- given instructions in natural language. These agents can automate…

Artificial Intelligence · Computer Science 2026-04-09 Pascal J. Sager , Benjamin Meyer , Peng Yan , Rebekka von Wartburg-Kottler , Layan Etaiwi , Aref Enayati , Gabriel Nobel , Ahmed Abdulkadir , Benjamin F. Grewe , Thilo Stadelmann

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

CUA-Skill: Develop Skills for Computer Using Agent

Computer-Using Agents (CUAs) aim to autonomously operate computer systems to complete real-world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of…

Artificial Intelligence · Computer Science 2026-02-04 Tianyi Chen , Yinheng Li , Michael Solodko , Sen Wang , Nan Jiang , Tingyuan Cui , Junheng Hao , Jongwoo Ko , Sara Abdali , Leon Xu , Suzhen Zheng , Hao Fan , Pashmina Cameron , Justin Wagle , Kazuhito Koishida

ProBench: Benchmarking GUI Agents with Accurate Process Information

With the deep integration of artificial intelligence and interactive technology, Graphical User Interface (GUI) Agent, as the carrier connecting goal-oriented natural language and real-world devices, has received widespread attention from…

Artificial Intelligence · Computer Science 2025-11-13 Leyang Yang , Ziwei Wang , Xiaoxuan Tang , Sheng Zhou , Dajun Chen , Wei Jiang , Yong Li

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon…

Machine Learning · Computer Science 2026-03-31 Yi Yu , Guangquan Hu , Chenghuang Shen , Xingyan Liu , Jing Gu , Hangyi Sun , Junzhuo Ma , Weiting Liu , Jianfeng Liu , Mingyue Pu , Yu Wang , Zhengdong Xiao , Rui Xie , Longjiu Luo , Qianrong Wang , Gurong Cui , Honglin Qiao , Wenlian Lu

ACEBench: Who Wins the Match Point in Tool Usage?

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs'…

Computation and Language · Computer Science 2025-11-21 Chen Chen , Xinlong Hao , Weiwen Liu , Xu Huang , Xingshan Zeng , Shuai Yu , Dexun Li , Shuai Wang , Weinan Gan , Yuefeng Huang , Wulong Liu , Xinzhi Wang , Defu Lian , Baoqun Yin , Yasheng Wang , Wu Liu

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To…

Computation and Language · Computer Science 2025-10-20 Wei He , Yueqing Sun , Hongyan Hao , Xueyuan Hao , Zhikang Xia , Qi Gu , Chengcheng Han , Dengchang Zhao , Hui Su , Kefeng Zhang , Man Gao , Xi Su , Xiaodong Cai , Xunliang Cai , Yu Yang , Yunke Zhao

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-28 Xuehui Wang , Zhenyu Wu , JingJing Xie , Zichen Ding , Bowen Yang , Zehao Li , Zhaoyang Liu , Qingyun Li , Xuan Dong , Zhe Chen , Weiyun Wang , Xiangyu Zhao , Jixuan Chen , Haodong Duan , Tianbao Xie , Chenyu Yang , Shiqian Su , Yue Yu , Yuan Huang , Yiqian Liu , Xiao Zhang , Yanting Zhang , Xiangyu Yue , Weijie Su , Xizhou Zhu , Wei Shen , Jifeng Dai , Wenhai Wang

VenusBench-Mobile: A Challenging and User-Centric Benchmark for Mobile GUI Agents with Capability Diagnostics

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online…

Human-Computer Interaction · Computer Science 2026-04-09 Yichen Gong , Zhuohan Cai , Sunhao Dai , Yuqi Zhou , Zhangxuan Gu , Changhua Meng , Shuheng Shen

Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer…

Computation and Language · Computer Science 2026-05-11 Zejian Chen , Zhanyuan Liu , Chaozhuo Li , Mengxiang Han , Songyang Liu , Litian Zhang , Feng Gao , Yiming Hei , Xi Zhang

UI-CUBE: Enterprise-Grade Computer Use Agent Benchmarking Beyond Task Accuracy to Operational Reliability

While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for…

Software Engineering · Computer Science 2025-11-24 Horia Cristescu , Charles Park , Trong Canh Nguyen , Sergiu Talmacel , Alexandru-Gabriel Ilie , Stefan Adam

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often…

Artificial Intelligence · Computer Science 2026-04-27 Bin Wu , Arastun Mammadli , Xiaoyu Zhang , Emine Yilmaz

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from…

Software Engineering · Computer Science 2026-05-19 Qingnan Ren , Shun Zou , Shiting Huang , Ziao Zhang , Kou Shi , Zhen Fang , Yiming Zhao , Yu Zeng , Qisheng Su , Lin Chen , Yong Wang , Zehui Chen , Xiangxiang Chu , Feng Zhao

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Shortcuts such as APIs and deep-links have emerged as efficient complements to flexible GUI operations, fostering a promising hybrid paradigm for MLLM-based mobile automation. However, systematic evaluation of GUI-shortcut hybrid agents…

Artificial Intelligence · Computer Science 2026-04-16 Pengxiang Zhao , Guangyi Liu , YaoZhen Liang , Weiqing He , Zhengxi Lu , WenHao Wang , Yuehao Huang , Yuxiang Chai , Zhaolu Kang , Yaxuan Guo , Hao Wang , Kexin Zhang , Liang Liu , Yong Liu

FEABench: Evaluating Language Models on Multiphysics Reasoning Ability

Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language…

Artificial Intelligence · Computer Science 2025-04-09 Nayantara Mudur , Hao Cui , Subhashini Venugopalan , Paul Raccuglia , Michael P. Brenner , Peter Norgaard