English
Related papers

Related papers: WebSuite: Systematically Evaluating Why Web Agents…

200 papers

Web agents automate browser tasks, ranging from simple form completion to complex workflows like ordering groceries. While current benchmarks evaluate general-purpose performance~(e.g., WebArena) or safety against malicious actions~(e.g.,…

Cryptography and Security · Computer Science 2026-04-09 Guruprasad Viswanathan Ramesh , Asmit Nayak , Basieem Siddique , Kassem Fawaz

Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First,…

Computation and Language · Computer Science 2026-04-22 Xianren Zhang , Shreyas Prasad , Di Wang , Qiuhai Zeng , Suhang Wang , Wenbo Yan , Mat Hans

Large language model-based web agents have demonstrated strong performance on realistic web interaction tasks. However, existing evaluations are predominantly conducted under relatively stable and well-behaved interaction conditions, which…

Software Engineering · Computer Science 2026-04-21 Haoyue Bai , Dong Wang , Long Chen , Bingguang Hao , Pengyang Shao , Yonghui Yang , Yicheng He , Chenyi Zhuang

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for…

Machine Learning · Computer Science 2025-02-26 George Thomas , Alex J. Chan , Jikun Kang , Wenqi Wu , Filippos Christianos , Fraser Greenlee , Andy Toulis , Marvin Purtorab

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These…

Autonomous web agents solve complex browsing tasks, yet existing benchmarks measure only whether an agent finishes a task, ignoring whether it does so safely or in a way enterprises can trust. To integrate these agents into critical…

Artificial Intelligence · Computer Science 2026-03-03 Ido Levy , Ben Wiesel , Sami Marreed , Alon Oved , Avi Yaeli , Segev Shlomov

General virtual agents need to handle multimodal observations, master complex action spaces, and self-improve in dynamic, open-domain environments. However, existing environments are often domain-specific and require complex setups, which…

Artificial Intelligence · Computer Science 2025-02-17 Longtao Zheng , Zhiyuan Huang , Zhenghai Xue , Xinrun Wang , Bo An , Shuicheng Yan

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how…

Artificial Intelligence · Computer Science 2026-03-16 Orit Shahnovsky , Rotem Dror

Recent advances in browser-based LLM agents have shown promise for automating tasks ranging from simple form filling to hotel booking or online shopping. Current benchmarks measure agent performance in controlled environments, such as…

Artificial Intelligence · Computer Science 2025-10-07 Su Kara , Fazle Faisal , Suman Nath

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating…

Artificial Intelligence · Computer Science 2026-05-01 Jinchao Li , Yunxin Li , Chenrui Zhao , Zhenran Xu , Baotian Hu , Min Zhang

WebAssembly (Wasm) has become a key compilation target for portable and efficient execution across diverse platforms. Benchmarking its performance, however, is a multi-dimensional challenge: it depends not only on the choice of runtime…

Performance · Computer Science 2026-02-06 Riccardo Carissimi , Ben L. Titzer

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the…

Computation and Language · Computer Science 2024-07-17 Yichen Pan , Dehan Kong , Sida Zhou , Cheng Cui , Yifei Leng , Bing Jiang , Hangyu Liu , Yanyi Shang , Shuyan Zhou , Tongshuang Wu , Zhengyang Wu

AI-powered web agents have the potential to automate repetitive tasks, such as form filling, information retrieval, and scheduling, but they struggle to reliably execute these tasks without human intervention, requiring users to provide…

Human-Computer Interaction · Computer Science 2026-01-27 Yimeng Liu , Misha Sra , Jeevana Priya Inala , Chenglong Wang

Browser agents enable autonomous web interaction but face critical reliability and security challenges in production. This paper presents findings from building and operating a production browser agent. The analysis examines where current…

Software Engineering · Computer Science 2025-11-26 Aram Vardanyan

To achieve successful assistance with long-horizon web-based tasks, AI agents must be able to sequentially follow real-world user instructions over a long period. Unlike existing web-based agent benchmarks, sequential instruction following…

Artificial Intelligence · Computer Science 2025-12-02 Suyu Ye , Haojun Shi , Darren Shih , Hyokun Yun , Tanya Roosta , Tianmin Shu

Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks,…

Machine Learning · Computer Science 2026-04-29 Lawrence Keunho Jang , Jing Yu Koh , Daniel Fried , Ruslan Salakhutdinov

Autonomous agent systems powered by Large Language Models (LLMs) have demonstrated promising capabilities in automating complex tasks. However, current evaluations largely rely on success rates without systematically analyzing the…

Artificial Intelligence · Computer Science 2025-08-19 Ruofan Lu , Yichen Li , Yintong Huo

Multimodal AI agents are increasingly automating complex real-world workflows that involve online web execution. However, current web-agent benchmarks suffer from a critical limitation: they focus entirely on web-based interaction and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Shoubin Yu , Lei Shu , Antoine Yang , Yao Fu , Srinivas Sunkara , Maria Wang , Jindong Chen , Mohit Bansal , Boqing Gong

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke…

Artificial Intelligence · Computer Science 2026-05-26 Henry Hengyuan Zhao , Kaiming Yang , Wendi Yu , Difei Gao , Mike Zheng Shou

With advances in generative AI, there is now potential for autonomous agents to manage daily tasks via natural language commands. However, current agents are primarily created and tested in simplified synthetic environments, leading to a…

Artificial Intelligence · Computer Science 2024-04-17 Shuyan Zhou , Frank F. Xu , Hao Zhu , Xuhui Zhou , Robert Lo , Abishek Sridhar , Xianyi Cheng , Tianyue Ou , Yonatan Bisk , Daniel Fried , Uri Alon , Graham Neubig
‹ Prev 1 2 3 10 Next ›