English
Related papers

Related papers: AppSelectBench: Application-Level Tool Selection B…

200 papers

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing…

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. Recent work demonstrates that these API-based agents exhibit…

Software Engineering · Computer Science 2025-01-24 Haiyang Shen , Yue Li , Desong Meng , Dongqi Cai , Sheng Qi , Li Zhang , Mengwei Xu , Yun Ma

Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and…

Artificial Intelligence · Computer Science 2026-04-22 Daniel Shepard , Robin Salimans

AI text-to-app tools promise high quality applications and websites in minutes, yet no public benchmark rigorously verifies those claims. We introduce UI-Bench, the first large-scale benchmark that evaluates visual excellence across…

Computation and Language · Computer Science 2025-09-05 Sam Jung , Agustin Garcinuno , Spencer Mateega

Agents for computer use (ACUs) are an emerging class of systems capable of executing complex tasks on digital devices -- such as desktops, mobile phones, and web platforms -- given instructions in natural language. These agents can automate…

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Computer-Using Agents (CUAs) aim to autonomously operate computer systems to complete real-world tasks. However, existing agentic systems remain difficult to scale and lag behind human performance. A key limitation is the absence of…

With the deep integration of artificial intelligence and interactive technology, Graphical User Interface (GUI) Agent, as the carrier connecting goal-oriented natural language and real-world devices, has received widespread attention from…

Artificial Intelligence · Computer Science 2025-11-13 Leyang Yang , Ziwei Wang , Xiaoxuan Tang , Sheng Zhou , Dajun Chen , Wei Jiang , Yong Li

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon…

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs'…

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To…

Computation and Language · Computer Science 2025-10-20 Wei He , Yueqing Sun , Hongyan Hao , Xueyuan Hao , Zhikang Xia , Qi Gu , Chengcheng Han , Dengchang Zhao , Hui Su , Kefeng Zhang , Man Gao , Xi Su , Xiaodong Cai , Xunliang Cai , Yu Yang , Yunke Zhao

We introduce MMBench-GUI, a hierarchical benchmark for evaluating GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web platforms. It comprises four levels: GUI Content Understanding, Element Grounding, Task Automation,…

Existing online benchmarks for mobile GUI agents remain largely app-centric and task-homogeneous, failing to reflect the diversity and instability of real-world mobile usage. To this end, we introduce VenusBench-Mobile, a challenging online…

Human-Computer Interaction · Computer Science 2026-04-09 Yichen Gong , Zhuohan Cai , Sunhao Dai , Yuqi Zhou , Zhangxuan Gu , Changhua Meng , Shuheng Shen

Computer-use agents(CUAs)are moving frombounded benchmarks toward real software environments, wherethey operate browsers, desktops, mobile applications, flesystems,terminals, and tool backends. In such settings, reliability isno longer…

Computation and Language · Computer Science 2026-05-11 Zejian Chen , Zhanyuan Liu , Chaozhuo Li , Mengxiang Han , Songyang Liu , Litian Zhang , Feng Gao , Yiming Hei , Xi Zhang

While current Computer Use Agent (CUA) benchmarks measure task completion effectively, they provide limited assessment of enterprise deployment readiness, emphasizing functional correctness over the operational reliability required for…

Software Engineering · Computer Science 2025-11-24 Horia Cristescu , Charles Park , Trong Canh Nguyen , Sergiu Talmacel , Alexandru-Gabriel Ilie , Stefan Adam

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often…

Artificial Intelligence · Computer Science 2026-04-27 Bin Wu , Arastun Mammadli , Xiaoyu Zhang , Emine Yilmaz

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from…

Software Engineering · Computer Science 2026-05-19 Qingnan Ren , Shun Zou , Shiting Huang , Ziao Zhang , Kou Shi , Zhen Fang , Yiming Zhao , Yu Zeng , Qisheng Su , Lin Chen , Yong Wang , Zehui Chen , Xiangxiang Chu , Feng Zhao

Shortcuts such as APIs and deep-links have emerged as efficient complements to flexible GUI operations, fostering a promising hybrid paradigm for MLLM-based mobile automation. However, systematic evaluation of GUI-shortcut hybrid agents…

Building precise simulations of the real world and invoking numerical solvers to answer quantitative problems is an essential requirement in engineering and science. We present FEABench, a benchmark to evaluate the ability of large language…

Artificial Intelligence · Computer Science 2025-04-09 Nayantara Mudur , Hao Cui , Subhashini Venugopalan , Paul Raccuglia , Michael P. Brenner , Peter Norgaard
‹ Prev 1 2 3 10 Next ›