AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan; Xuyan Ye; Yupeng Huo; Zhi-Yuan Chen; Yiju Guo; Shenzhi Yang; Wenkai Yang; Shuqi Ye; Jingwen Chen; Haotian Chen; Xin Cong; Yankai Lin

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Artificial Intelligence 2026-03-17 v1

Authors: Shengda Fan , Xuyan Ye , Yupeng Huo , Zhi-Yuan Chen , Yiju Guo , Shenzhi Yang , Wenkai Yang , Shuqi Ye , Jingwen Chen , Haotian Chen , Xin Cong , Yankai Lin

View on arXiv ↗ PDF ↗

Abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Keywords

benchmarking llm agents large language model evaluation

Cite

@article{arxiv.2603.14465,
  title  = {AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents},
  author = {Shengda Fan and Xuyan Ye and Yupeng Huo and Zhi-Yuan Chen and Yiju Guo and Shenzhi Yang and Wenkai Yang and Shuqi Ye and Jingwen Chen and Haotian Chen and Xin Cong and Yankai Lin},
  journal= {arXiv preprint arXiv:2603.14465},
  year   = {2026}
}

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Abstract

Keywords

Cite

Related papers