Related papers: AgentProcessBench: Diagnosing Step-Level Process Q…

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks.…

Machine Learning · Computer Science 2025-10-07 Xing Han Lù , Amirhossein Kazemnejad , Nicholas Meade , Arkil Patel , Dongchan Shin , Alejandra Zambrano , Karolina Stańczak , Peter Shaw , Christopher J. Pal , Siva Reddy

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture…

Artificial Intelligence · Computer Science 2026-04-24 Keyu Li , Junhao Shi , Yang Xiao , Mohan Jiang , Jie Sun , Yunze Wu , Dayuan Fu , Shijie Xia , Xiaojie Cai , Tianze Xu , Weiye Si , Wenjie Li , Dequan Wang , Pengfei Liu

ToolPRMBench: Evaluating and Advancing Process Reward Models for Tool-using Agents

Reward-guided search methods have demonstrated strong potential in enhancing tool-using agents by effectively guiding sampling and exploration over complex action spaces. As a core design, those search methods utilize process reward models…

Artificial Intelligence · Computer Science 2026-01-21 Dawei Li , Yuguang Yao , Zhen Tan , Huan Liu , Ruocheng Guo

HumanStudy-Bench: Towards AI Agent Design for Participant Simulation

Large language models (LLMs) are increasingly used as simulated participants in social science experiments, but their behavior is often unstable and highly sensitive to design choices. Prior evaluations frequently conflate base-model…

Artificial Intelligence · Computer Science 2026-02-03 Xuan Liu , Haoyang Shang , Zizhang Liu , Xinyan Liu , Yunze Xiao , Yiwen Tu , Haojian Jin

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents

As Multimodal Large Language Models (MLLMs) advance, multimodal agents show promise in real-world tasks like web navigation and embodied intelligence. However, due to limitations in a lack of external feedback, these agents struggle with…

Computation and Language · Computer Science 2025-06-27 Tianyi Men , Zhuoran Jin , Pengfei Cao , Yubo Chen , Kang Liu , Jun Zhao

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial…

Computation and Language · Computer Science 2024-12-25 Chang Ma , Junlei Zhang , Zhihao Zhu , Cheng Yang , Yujiu Yang , Yaohui Jin , Zhenzhong Lan , Lingpeng Kong , Junxian He

ACEBench: Who Wins the Match Point in Tool Usage?

Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs'…

Computation and Language · Computer Science 2025-11-21 Chen Chen , Xinlong Hao , Weiwen Liu , Xu Huang , Xingshan Zeng , Shuai Yu , Dexun Li , Shuai Wang , Weinan Gan , Yuefeng Huang , Wulong Liu , Xinzhi Wang , Defu Lian , Baoqun Yin , Yasheng Wang , Wu Liu

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks…

Artificial Intelligence · Computer Science 2026-05-14 Yu Li , Haoyu Luo , Yuejin Xie , Yuqian Fu , Zhonghao Yang , Shuai Shao , Qihan Ren , Wanying Qu , Yanwei Fu , Yujiu Yang , Jing Shao , Xia Hu , Dongrui Liu

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an…

Artificial Intelligence · Computer Science 2026-05-21 Zhengkang Guo , Yiyang Li , Lin Qiu , Xiaohua Wang , Jingwen Xv , Dongyu Ru , Xiaoyu Li , Xiaoqing Zheng , Xuezhi Cao , Xunliang Cai

Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench

Automated evaluation of tool-using large language model (LLM) agents is widely assumed to be reliable, but this assumption has rarely been validated against human annotation. We introduce AgentProp-Bench, a 2,000-task benchmark with 2,300…

Artificial Intelligence · Computer Science 2026-04-21 Bhaskar Gurram

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve…

Computation and Language · Computer Science 2026-01-29 Yiting Shen , Kun Li , Wei Zhou , Songlin Hu

Benchmark Test-Time Scaling of General LLM Agents

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating…

Artificial Intelligence · Computer Science 2026-02-24 Xiaochuan Li , Ryan Ming , Pranav Setlur , Abhijay Paladugu , Andy Tang , Hao Kang , Shuai Shao , Rong Jin , Chenyan Xiong

AgentQuest: A Modular Benchmark Framework to Measure Progress and Improve LLM Agents

The advances made by Large Language Models (LLMs) have led to the pursuit of LLM agents that can solve intricate, multi-step reasoning tasks. As with any research pursuit, benchmarking and evaluation are key corner stones to efficient and…

Artificial Intelligence · Computer Science 2024-04-10 Luca Gioacchini , Giuseppe Siracusano , Davide Sanvito , Kiril Gashteovski , David Friede , Roberto Bifulco , Carolin Lawrence

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To…

Computation and Language · Computer Science 2025-10-20 Wei He , Yueqing Sun , Hongyan Hao , Xueyuan Hao , Zhikang Xia , Qi Gu , Chengcheng Han , Dengchang Zhao , Hui Su , Kefeng Zhang , Man Gao , Xi Su , Xiaodong Cai , Xunliang Cai , Yu Yang , Yunke Zhao

TRAJECT-Bench:A Trajectory-Aware Benchmark for Evaluating Agentic Tool Use

Large language model (LLM)-based agents increasingly rely on tool use to complete real-world tasks. While existing works evaluate the LLMs' tool use capability, they largely focus on the final answers yet overlook the detailed tool usage…

Artificial Intelligence · Computer Science 2025-10-14 Pengfei He , Zhenwei Dai , Bing He , Hui Liu , Xianfeng Tang , Hanqing Lu , Juanhui Li , Jiayuan Ding , Subhabrata Mukherjee , Suhang Wang , Yue Xing , Jiliang Tang , Benoit Dumoulin

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce…

Artificial Intelligence · Computer Science 2025-11-04 Yunze Wu , Dayuan Fu , Weiye Si , Zhen Huang , Mohan Jiang , Keyu Li , Shijie Xia , Jie Sun , Tianze Xu , Xiangkun Hu , Pengrui Lu , Xiaojie Cai , Lyumanshan Ye , Wenhong Zhu , Yang Xiao , Pengfei Liu

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

Autonomous agents have rapidly matured as task executors and seen widespread deployment via harnesses such as OpenClaw. Safety concerns have rightly drawn growing research attention, and beneath them lie the values silently steering agent…

Artificial Intelligence · Computer Science 2026-05-12 Haonan Dong , Qiguan Feng , Kehan Jiang , Haoran Ye , Xin Zhang , Guojie Song

AgentRecBench: Benchmarking LLM Agent-based Personalized Recommender Systems

The emergence of agentic recommender systems powered by Large Language Models (LLMs) represents a paradigm shift in personalized recommendations, leveraging LLMs' advanced reasoning and role-playing capabilities to enable autonomous,…

Information Retrieval · Computer Science 2025-05-29 Yu Shang , Peijie Liu , Yuwei Yan , Zijing Wu , Leheng Sheng , Yuanqing Yu , Chumeng Jiang , An Zhang , Fengli Xu , Yu Wang , Min Zhang , Yong Li