Related papers: ProcessBench: Identifying Process Errors in Mathem…

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during…

Computation and Language · Computer Science 2025-07-01 Mingyang Song , Zhaochen Su , Xiaoye Qu , Jiawei Zhou , Yu Cheng

MPBench: A Comprehensive Multimodal Reasoning Benchmark for Process Errors Identification

Reasoning is an essential capacity for large language models (LLMs) to address complex tasks, where the identification of process errors is vital for improving this ability. Recently, process-level reward models (PRMs) were proposed to…

Artificial Intelligence · Computer Science 2025-03-18 Zhaopan Xu , Pengfei Zhou , Jiaxin Ai , Wangbo Zhao , Kai Wang , Xiaojiang Peng , Wenqi Shao , Hongxun Yao , Kaipeng Zhang

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and…

Artificial Intelligence · Computer Science 2026-05-08 Zhouhao Sun , Xuan Zhang , Xiao Ding , Bibo Cai , Li Du , Kai Xiong , Xinran Dai , Fei Zhang , weidi tang , Zhiyuan Kan , Yang Zhao , Bing Qin , Ting Liu

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may…

Artificial Intelligence · Computer Science 2025-05-30 Xiang Li , Haiyang Yu , Xinghua Zhang , Ziyang Huang , Shizhu He , Kang Liu , Jun Zhao , Fei Huang , Yongbin Li

MedPRMBench: A Fine-grained Benchmark for Process Reward Models in Medical Reasoning

Process-Level Reward Models (PRMs) are essential for guiding complex reasoning in large language models, yet existing PRM benchmarks cover only general domains such as mathematics, failing to address medical reasoning -- which is uniquely…

Computation and Language · Computer Science 2026-04-21 Lingyan Wu , Xiang Zheng , Weiqi Zhai , Wei Wang , Xuan Ren , Zifan Zhang , Hu Wei , Bing Zhao

R-PRM: Reasoning-Driven Process Reward Modeling

Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step. However, existing PRMs typically…

Computation and Language · Computer Science 2025-03-28 Shuaijie She , Junxiao Liu , Yifeng Liu , Jiajun Chen , Xin Huang , Shujian Huang

Error Typing for Smarter Rewards: Improving Process Reward Models with Error-Aware Hierarchical Supervision

Large Language Models (LLMs) are prone to hallucination, especially during multi-hop and reasoning-intensive tasks such as mathematical problem solving. While Outcome Reward Models verify only final answers, Process Reward Models (PRMs)…

Computation and Language · Computer Science 2025-05-27 Tej Deep Pala , Panshul Sharma , Amir Zadeh , Chuan Li , Soujanya Poria

ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure

Reasoning is central to a wide range of intellectual activities, and while the capabilities of large language models (LLMs) continue to advance, their performance in reasoning tasks remains limited. The processes and mechanisms underlying…

Artificial Intelligence · Computer Science 2024-10-07 Ippei Fujisawa , Sensho Nobe , Hiroki Seto , Rina Onda , Yoshiaki Uchida , Hiroki Ikoma , Pei-Chun Chien , Ryota Kanai

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction

The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving…

Computation and Language · Computer Science 2024-06-05 Xiaoyuan Li , Wenjie Wang , Moxin Li , Junrong Guo , Yang Zhang , Fuli Feng

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the…

Computation and Language · Computer Science 2025-06-06 Zhenru Zhang , Chujie Zheng , Yangzhen Wu , Beichen Zhang , Runji Lin , Bowen Yu , Dayiheng Liu , Jingren Zhou , Junyang Lin

MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification

To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with…

Computation and Language · Computer Science 2024-07-03 Kai Sun , Yushi Bai , Ji Qi , Lei Hou , Juanzi Li

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are…

Computation and Language · Computer Science 2025-10-01 Johannes Schmitt , Gergely Bérczi , Jasper Dekoninck , Jeremy Feusi , Tim Gehrunger , Raphael Appenzeller , Jim Bryan , Niklas Canova , Timo de Wolff , Filippo Gaia , Michel van Garrel , Baran Hashemi , David Holmes , Aitor Iribar Lopez , Victor Jaeck , Martina Jørgensen , Steven Kelk , Stefan Kuhlmann , Adam Kurpisz , Chiara Meroni , Ingmar Metzler , Martin Möller , Samuel Muñoz-Echániz , Robert Nowak , Georg Oberdieck , Daniel Platt , Dylan Possamaï , Gabriel Ribeiro , Raúl Sánchez Galán , Zheming Sun , Josef Teichmann , Richard P. Thomas , Charles Vial

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning, often producing logically inconsistent or partially correct solutions.…

Artificial Intelligence · Computer Science 2025-06-06 Lingxiao Du , Fanqing Meng , Zongkai Liu , Zhixiang Zhou , Ping Luo , Qiaosheng Zhang , Wenqi Shao

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning

The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark…

Computation and Language · Computer Science 2024-06-04 Zicheng Lin , Zhibin Gou , Tian Liang , Ruilin Luo , Haowei Liu , Yujiu Yang

Is Mathematical Problem-Solving Expertise in Large Language Models Associated with Assessment Performance?

Large Language Models (LLMs) are increasingly used in math education not only as problem solvers but also as assessors of learners' reasoning. However, it remains unclear whether stronger math problem-solving ability is associated with…

Artificial Intelligence · Computer Science 2026-03-27 Liang Zhang , Yu Fu , Xinyi Jin

GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning

Recent advancements in Large Language Models (LLMs) have shown that it is promising to utilize Process Reward Models (PRMs) as verifiers to enhance the performance of LLMs. However, current PRMs face three key challenges: (1) limited…

Computation and Language · Computer Science 2025-04-08 Jian Zhao , Runze Liu , Kaiyan Zhang , Zhimu Zhou , Junqi Gao , Dong Li , Jiafei Lyu , Zhouyi Qian , Biqing Qi , Xiu Li , Bowen Zhou

BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models

Evaluating language models fairly is increasingly difficult as static benchmarks risk contamination by training data, obscuring whether models truly reason or recall. We introduce BeyondBench, an evaluation framework using algorithmic…

Computation and Language · Computer Science 2026-03-06 Gaurav Srivastava , Aafiya Hussain , Zhenyu Bi , Swastik Roy , Priya Pitre , Meng Lu , Morteza Ziyadi , Xuan Wang

Improve Mathematical Reasoning in Language Models by Automated Process Supervision

Complex multi-step reasoning tasks, such as solving mathematical problems or generating code, remain a significant hurdle for even the most advanced large language models (LLMs). Verifying LLM outputs with an Outcome Reward Model (ORM) is a…

Computation and Language · Computer Science 2024-12-13 Liangchen Luo , Yinxiao Liu , Rosanne Liu , Samrat Phatale , Meiqi Guo , Harsh Lara , Yunxuan Li , Lei Shu , Yun Zhu , Lei Meng , Jiao Sun , Abhinav Rastogi

GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning

Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure.…

Computation and Language · Computer Science 2025-08-08 Jianghangfan Zhang , Yibo Yan , Kening Zheng , Xin Zou , Song Dai , Xuming Hu

ProJudge: A Multi-Modal Multi-Discipline Benchmark and Instruction-Tuning Dataset for MLLM-based Process Judges

As multi-modal large language models (MLLMs) frequently exhibit errors when solving scientific problems, evaluating the validity of their reasoning processes is critical for ensuring reliability and uncovering fine-grained model weaknesses.…

Artificial Intelligence · Computer Science 2025-03-11 Jiaxin Ai , Pengfei Zhou , Zhaopan Xu , Ming Li , Fanrui Zhang , Zizhen Li , Jianwen Sun , Yukang Feng , Baojin Huang , Zhongyuan Wang , Kaipeng Zhang