Related papers: WebSuite: Systematically Evaluating Why Web Agents…

WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g.,…

Cryptography and Security · Computer Science 2026-04-09 Guruprasad Viswanathan Ramesh , Asmit Nayak , Basieem Siddique , Kassem Fawaz

A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First,…

Computation and Language · Computer Science 2026-04-22 Xianren Zhang , Shreyas Prasad , Di Wang , Qiuhai Zeng , Suhang Wang , Wenbo Yan , Mat Hans

StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability

Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which…

Software Engineering · Computer Science 2026-04-21 Haoyue Bai , Dong Wang , Long Chen , Bingguang Hao , Pengyang Shao , Yonghui Yang , Yicheng He , Chenyi Zhuang

WebGames: Challenging General-Purpose Web-Browsing AI Agents

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for…

Machine Learning · Computer Science 2025-02-26 George Thomas , Alex J. Chan , Jikun Kang , Wenqi Wu , Filippos Christianos , Fraser Greenlee , Andy Toulis , Marvin Purtorab

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These…

Artificial Intelligence · Computer Science 2025-08-08 Yuxuan Zhu , Tengjun Jin , Yada Pruksachatkun , Andy Zhang , Shu Liu , Sasha Cui , Sayash Kapoor , Shayne Longpre , Kevin Meng , Rebecca Weiss , Fazl Barez , Rahul Gupta , Jwala Dhamala , Jacob Merizian , Mario Giulianelli , Harry Coppock , Cozmin Ududec , Jasjeet Sekhon , Jacob Steinhardt , Antony Kellermann , Sarah Schwettmann , Matei Zaharia , Ion Stoica , Percy Liang , Daniel Kang

ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical…

Artificial Intelligence · Computer Science 2026-03-03 Ido Levy , Ben Wiesel , Sami Marreed , Alon Oved , Avi Yaeli , Segev Shlomov

AgentStudio: A Toolkit for Building General Virtual Agents

General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which…

Artificial Intelligence · Computer Science 2025-02-17 Longtao Zheng , Zhiyuan Huang , Zhenghai Xue , Xinrun Wang , Bo An , Shuicheng Yan

AI Planning Framework for LLM-Based Web Agents

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how…

Artificial Intelligence · Computer Science 2026-03-16 Orit Shahnovsky , Rotem Dror

WAREX: Web Agent Reliability Evaluation on Existing Benchmarks

Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as…

Artificial Intelligence · Computer Science 2025-10-07 Su Kara , Fazle Faisal , Suman Nath

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating…

Artificial Intelligence · Computer Science 2026-05-01 Jinchao Li , Yunxin Li , Chenrui Zhao , Zhenran Xu , Baotian Hu , Min Zhang

Wasure: A Modular Toolkit for Comprehensive WebAssembly Benchmarking

WebAssembly (Wasm) has become a key compilation target for portable and efficient execution across diverse platforms. Benchmarking its performance, however, is a multi-dimensional challenge: it depends not only on the choice of runtime…

Performance · Computer Science 2026-02-06 Riccardo Carissimi , Ben L. Titzer

WebCanvas: Benchmarking Web Agents in Online Environments

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the…

Computation and Language · Computer Science 2024-07-17 Yichen Pan , Dehan Kong , Sida Zhou , Cheng Cui , Yifei Leng , Bing Jiang , Hangyu Liu , Yanyi Shang , Shuyan Zhou , Tongshuang Wu , Zhengyang Wu

ReUseIt: Synthesizing Reusable AI Agent Workflows for Web Automation

AI-powered web agents have the potential to automate repetitive tasks, such as form filling, information retrieval, and scheduling, but they struggle to reliably execute these tasks without human intervention, requiring users to provide…

Human-Computer Interaction · Computer Science 2026-01-27 Yimeng Liu , Misha Sra , Jeevana Priya Inala , Chenglong Wang

Building Browser Agents: Architecture, Security, and Practical Solutions

Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current…

Software Engineering · Computer Science 2025-11-26 Aram Vardanyan

RealWebAssist: A Benchmark for Long-Horizon Web Assistance with Real-World Users

To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following…

Artificial Intelligence · Computer Science 2025-12-02 Suyu Ye , Haojun Shi , Darren Shih , Hyokun Yun , Tanya Roosta , Tianmin Shu

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks,…

Machine Learning · Computer Science 2026-04-29 Lawrence Keunho Jang , Jing Yu Koh , Daniel Fried , Ruslan Salakhutdinov

Exploring Autonomous Agents: A Closer Look at Why They Fail When Completing Tasks

Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the…

Artificial Intelligence · Computer Science 2025-08-19 Ruofan Lu , Yichen Li , Yintong Huo

Ego2Web: A Web Agent Benchmark Grounded in Egocentric Videos

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Shoubin Yu , Lei Shu , Antoine Yang , Yao Fu , Srinivas Sunkara , Maria Wang , Jindong Chen , Mohit Bansal , Boqing Gong

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke…

Artificial Intelligence · Computer Science 2026-05-26 Henry Hengyuan Zhao , Kaiming Yang , Wendi Yu , Difei Gao , Mike Zheng Shou

WebArena: A Realistic Web Environment for Building Autonomous Agents

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a…

Artificial Intelligence · Computer Science 2024-04-17 Shuyan Zhou , Frank F. Xu , Hao Zhu , Xuhui Zhou , Robert Lo , Abishek Sridhar , Xianyi Cheng , Tianyue Ou , Yonatan Bisk , Daniel Fried , Uri Alon , Graham Neubig