Related papers: Automatically Generating Hard Math Problems from H…

HypoBench: Towards Systematic and Principled Benchmarking for Hypothesis Generation

There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address…

Artificial Intelligence · Computer Science 2026-02-12 Haokun Liu , Sicong Huang , Jingyu Hu , Yangqiaoyu Zhou , Chenhao Tan

AI-Assisted Generation of Difficult Math Questions

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both…

Artificial Intelligence · Computer Science 2025-02-04 Vedant Shah , Dingli Yu , Kaifeng Lyu , Simon Park , Jiatong Yu , Yinghui He , Nan Rosemary Ke , Michael Mozer , Yoshua Bengio , Sanjeev Arora , Anirudh Goyal

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are…

Computation and Language · Computer Science 2025-10-01 Johannes Schmitt , Gergely Bérczi , Jasper Dekoninck , Jeremy Feusi , Tim Gehrunger , Raphael Appenzeller , Jim Bryan , Niklas Canova , Timo de Wolff , Filippo Gaia , Michel van Garrel , Baran Hashemi , David Holmes , Aitor Iribar Lopez , Victor Jaeck , Martina Jørgensen , Steven Kelk , Stefan Kuhlmann , Adam Kurpisz , Chiara Meroni , Ingmar Metzler , Martin Möller , Samuel Muñoz-Echániz , Robert Nowak , Georg Oberdieck , Daniel Platt , Dylan Possamaï , Gabriel Ribeiro , Raúl Sánchez Galán , Zheming Sun , Josef Teichmann , Richard P. Thomas , Charles Vial

HARDMath: A Benchmark Dataset for Challenging Problems in Applied Mathematics

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring…

Machine Learning · Computer Science 2024-12-17 Jingxuan Fan , Sarah Martinson , Erik Y. Wang , Kaylie Hausknecht , Jonah Brenner , Danxian Liu , Nianli Peng , Corey Wang , Michael P. Brenner

Can a Lightweight Automated AI Pipeline Solve Research-Level Mathematical Problems?

Large language models (LLMs) have recently achieved remarkable success in generating rigorous mathematical proofs, with "AI for Math" emerging as a vibrant field of research (Ju et al., 2026). While these models have mastered…

Artificial Intelligence · Computer Science 2026-03-10 Lve Meng , Weilong Zhao , Yanzhi Zhang , Haoxiang Guan , Jiyan He

Benchmark Dataset Generation and Evaluation for Excel Formula Repair with LLMs

Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer…

Software Engineering · Computer Science 2025-08-19 Ananya Singha , Harshita Sahijwani , Walt Williams , Emmanuel Aboah Boateng , Nick Hausman , Miguel Di Luca , Keegan Choudhury , Chaya Binet , Vu Le , Tianwei Chen , Oryan Rokeah Chen , Sulaiman Vesal , Sadid Hasan

Synthesis by Design: Controlled Data Generation via Structural Guidance

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation…

Computation and Language · Computer Science 2025-06-12 Lei Xu , Sirui Chen , Yuxuan Huang , Chaochao Lu

Pipeline for Verifying LLM-Generated Mathematical Solutions

With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more…

Artificial Intelligence · Computer Science 2026-02-25 Varvara Sazonova , Dmitri Shmelkin , Stanislav Kikot , Vasily Motolygin

MathConstruct: Challenging LLM Reasoning with Constructive Proofs

While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem…

Artificial Intelligence · Computer Science 2025-10-02 Mislav Balunović , Jasper Dekoninck , Nikola Jovanović , Ivo Petrov , Martin Vechev

Adversarial Math Word Problem Generation

Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing…

Computation and Language · Computer Science 2024-06-18 Roy Xie , Chengxuan Huang , Junlin Wang , Bhuwan Dhingra

Assessing the Creativity of LLMs in Proposing Novel Solutions to Mathematical Problems

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing…

Computation and Language · Computer Science 2024-10-25 Junyi Ye , Jingyi Gu , Xinyun Zhao , Wenpeng Yin , Guiling Wang

SAND-Math: Using LLMs to Generate Novel, Difficult and Useful Mathematics Questions and Answers

The demand for Large Language Models (LLMs) at multiple scales, capable of sophisticated and sound mathematical reasoning, continues to grow. However, the development of performant mathematical LLMs is often bottlenecked by the scarcity of…

Computation and Language · Computer Science 2025-11-05 Chaitanya Manem , Pratik Prabhanjan Brahma , Prakamya Mishra , Zicheng Liu , Emad Barsoum

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a…

Computation and Language · Computer Science 2026-05-27 Wenda Xu , Sweta Agrawal , Vilém Zouhar , Markus Freitag , Daniel Deutsch

Hypothesis Generation for Materials Discovery and Design Using Goal-Driven and Constraint-Guided LLM Agents

Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this…

Computation and Language · Computer Science 2025-02-11 Shrinidhi Kumbhar , Venkatesh Mishra , Kevin Coutinho , Divij Handa , Ashif Iquebal , Chitta Baral

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated…

Computation and Language · Computer Science 2025-02-25 Qin Zhu , Fei Huang , Runyu Peng , Keming Lu , Bowen Yu , Qinyuan Cheng , Xipeng Qiu , Xuanjing Huang , Junyang Lin

Error as a Lens: Probing LLM Reasoning through Synthetic Misconception Generation

Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate…

Computation and Language · Computer Science 2026-05-29 Xinming Yang , Jun Li

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics…

Artificial Intelligence · Computer Science 2025-10-21 Jie Zhang , Cezara Petrui , Kristina Nikolić , Florian Tramèr

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive…

Artificial Intelligence · Computer Science 2026-05-19 Qingchuan Ma , Yuexiao Ma , Yongkang Xie , Tianyu Xie , Xiawu Zheng , Rongrong Ji

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky