Related papers: MathConstruct: Challenging LLM Reasoning with Cons…

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics…

Artificial Intelligence · Computer Science 2025-10-21 Jie Zhang , Cezara Petrui , Kristina Nikolić , Florian Tramèr

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining…

Computation and Language · Computer Science 2025-11-13 Boyang Xue , Qi Zhu , Rui Wang , Sheng Wang , Hongru Wang , Minda Hu , Fei Mi , Yasheng Wang , Lifeng Shang , Qun Liu , Kam-Fai Wong

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning…

Computation and Language · Computer Science 2026-05-20 Husnain Amjad , Raja Khurram Shahzad , Aamir Shahzad , Mehwish Fatima

StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs

The rapid advancement of large language models (LLMs) demands robust, unbiased, and scalable evaluation methods. However, human annotations are costly to scale, model-based evaluations are susceptible to stylistic biases, and…

Computation and Language · Computer Science 2025-03-21 Hailin Chen , Fangkai Jiao , Mathieu Ravaut , Nawshad Farruque , Xuan Phi Nguyen , Chengwei Qin , Manan Dey , Bosheng Ding , Caiming Xiong , Shafiq Joty , Yingbo Zhou

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise,…

Computation and Language · Computer Science 2026-05-12 Ivo Petrov , Jasper Dekoninck , Dimitar I. Dimitrov , Martin Vechev

MathScape: Benchmarking Multimodal Large Language Models in Real-World Mathematical Contexts

With the rapid progress of Multimodal LLMs, evaluating their mathematical reasoning capabilities has become an increasingly important research direction. In particular, visual-textual mathematical reasoning serves as a key indicator of an…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Hao Liang , Linzhuang Sun , Minxuan Zhou , Zirong Chen , Meiyi Qiang , Mingan Lin , Tianpeng Li , Fan Yang , Zenan Zhou , Wentao Zhang

Large Language Models Struggle with Unreasonability in Math Problems

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues,…

Computation and Language · Computer Science 2025-06-03 Jingyuan Ma , Damai Dai , Zihang Yuan , Rui li , Weilin Luo , Bin Wang , Qun Liu , Lei Sha , Zhifang Sui

Large Language Models for Mathematical Reasoning: Progresses and Challenges

Mathematical reasoning serves as a cornerstone for assessing the fundamental cognitive capabilities of human intelligence. In recent times, there has been a notable surge in the development of Large Language Models (LLMs) geared towards the…

Computation and Language · Computer Science 2024-09-18 Janice Ahn , Rishu Verma , Renze Lou , Di Liu , Rui Zhang , Wenpeng Yin

Thinking Machines: Mathematical Reasoning in the Age of LLMs

Large Language Models (LLMs) have demonstrated impressive capabilities in structured reasoning and symbolic tasks, with coding emerging as a particularly successful application. This progress has naturally motivated efforts to extend these…

Artificial Intelligence · Computer Science 2026-02-02 Andrea Asperti , Alberto Naibo , Claudio Sacerdoti Coen

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness'…

Computation and Language · Computer Science 2025-11-10 Andrew M. Bean , Ryan Othniel Kearns , Angelika Romanou , Franziska Sofia Hafner , Harry Mayne , Jan Batzner , Negar Foroutan , Chris Schmitz , Karolina Korgul , Hunar Batra , Oishi Deb , Emma Beharry , Cornelius Emde , Thomas Foster , Anna Gausen , María Grandury , Simeng Han , Valentin Hofmann , Lujain Ibrahim , Hazel Kim , Hannah Rose Kirk , Fangru Lin , Gabrielle Kaili-May Liu , Lennart Luettgau , Jabez Magomere , Jonathan Rystrøm , Anna Sotnikova , Yushi Yang , Yilun Zhao , Adel Bibi , Antoine Bosselut , Ronald Clark , Arman Cohan , Jakob Foerster , Yarin Gal , Scott A. Hale , Inioluwa Deborah Raji , Christopher Summerfield , Philip H. S. Torr , Cozmin Ududec , Luc Rocher , Adam Mahdi

Benchmarking Large Language Models for Math Reasoning Tasks

The use of Large Language Models (LLMs) in mathematical reasoning has become a cornerstone of related research, demonstrating the intelligence of these models and enabling potential practical applications through their advanced performance,…

Computation and Language · Computer Science 2024-12-20 Kathrin Seßler , Yao Rong , Emek Gözlüklü , Enkelejda Kasneci

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly…

Computation and Language · Computer Science 2026-04-03 Linyang He , Qiyao Yu , Hanze Dong , Baohao Liao , Xinxing Xu , Micah Goldblum , Jiang Bian , Nima Mesgarani

Evaluating Large Language Models for Real-World Engineering Tasks

Large Language Models (LLMs) are transformative not only for daily activities but also for engineering tasks. However, current evaluations of LLMs in engineering exhibit two critical shortcomings: (i) the reliance on simplified use cases,…

Artificial Intelligence · Computer Science 2025-05-21 Rene Heesch , Sebastian Eilermann , Alexander Windmann , Alexander Diedrich , Philipp Rosenthal , Oliver Niggemann

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Large language models (LLMs) demonstrate impressive capabilities in mathematical reasoning. However, despite these achievements, current evaluations are mostly limited to specific mathematical topics, and it remains unclear whether LLMs are…

Computation and Language · Computer Science 2025-04-01 Arash Gholami Davoodi , Seyed Pouyan Mousavi Davoudi , Pouya Pezeshkpour

LLMs for Mathematical Modeling: Towards Bridging the Gap between Natural and Mathematical Languages

Large Language Models (LLMs) have demonstrated strong performance across various natural language processing tasks, yet their proficiency in mathematical reasoning remains a key challenge. Addressing the gap between natural and mathematical…

Artificial Intelligence · Computer Science 2025-02-18 Xuhan Huang , Qingning Shen , Yan Hu , Anningzhe Gao , Benyou Wang

SATBench: Benchmarking LLMs' Logical Reasoning via Automated Puzzle Generation from SAT Formulas

We introduce SATBench, a benchmark for evaluating the logical reasoning capabilities of large language models (LLMs) through logical puzzles derived from Boolean satisfiability (SAT) problems. Unlike prior work that focuses on inference…

Artificial Intelligence · Computer Science 2025-09-23 Anjiang Wei , Yuheng Wu , Yingjia Wan , Tarun Suresh , Huanmi Tan , Zhanke Zhou , Sanmi Koyejo , Ke Wang , Alex Aiken

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

Exceptional mathematical reasoning ability is one of the key features that demonstrate the power of large language models (LLMs). How to comprehensively define and evaluate the mathematical abilities of LLMs, and even reflect the user…

Computation and Language · Computer Science 2024-10-10 Zihao Zhou , Shudong Liu , Maizhen Ning , Wei Liu , Jindong Wang , Derek F. Wong , Xiaowei Huang , Qiufeng Wang , Kaizhu Huang

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To…

Machine Learning · Computer Science 2025-02-14 Kaixuan Huang , Jiacheng Guo , Zihao Li , Xiang Ji , Jiawei Ge , Wenzhe Li , Yingqing Guo , Tianle Cai , Hui Yuan , Runzhe Wang , Yue Wu , Ming Yin , Shange Tang , Yangsibo Huang , Chi Jin , Xinyun Chen , Chiyuan Zhang , Mengdi Wang