English
Related papers

Related papers: Automatically Generating Hard Math Problems from H…

200 papers

There is growing interest in hypothesis generation with large language models (LLMs). However, fundamental questions remain: what makes a good hypothesis, and how can we systematically evaluate methods for hypothesis generation? To address…

Artificial Intelligence · Computer Science 2026-02-12 Haokun Liu , Sicong Huang , Jingyu Hu , Yangqiaoyu Zhou , Chenhao Tan

Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet demand for diverse and challenging math questions. Relying solely on human experts is both…

Artificial Intelligence · Computer Science 2025-02-04 Vedant Shah , Dingli Yu , Kaifeng Lyu , Simon Park , Jiatong Yu , Yinghui He , Nan Rosemary Ke , Michael Mozer , Yoshua Bengio , Sanjeev Arora , Anirudh Goyal

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are…

Advanced applied mathematics problems are underrepresented in existing Large Language Model (LLM) benchmark datasets. To address this, we introduce HARDMath, a dataset inspired by a graduate course on asymptotic methods, featuring…

Large language models (LLMs) have recently achieved remarkable success in generating rigorous mathematical proofs, with "AI for Math" emerging as a vibrant field of research (Ju et al., 2026). While these models have mastered…

Artificial Intelligence · Computer Science 2026-03-10 Lve Meng , Weilong Zhao , Yanzhi Zhang , Haoxiang Guan , Jiyan He

Excel is a pervasive yet often complex tool, particularly for novice users, where runtime errors arising from logical mistakes or misinterpretations of functions pose a significant challenge. While large language models (LLMs) offer…

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation…

Computation and Language · Computer Science 2025-06-12 Lei Xu , Sirui Chen , Yuxuan Huang , Chaochao Lu

With the growing popularity of Large Reasoning Models and their results in solving mathematical problems, it becomes crucial to measure their capabilities. We introduce a pipeline for both automatic and interactive verification as a more…

Artificial Intelligence · Computer Science 2026-02-25 Varvara Sazonova , Dmitri Shmelkin , Stanislav Kikot , Vasily Motolygin

While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem…

Artificial Intelligence · Computer Science 2025-10-02 Mislav Balunović , Jasper Dekoninck , Nikola Jovanović , Ivo Petrov , Martin Vechev

Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing…

Computation and Language · Computer Science 2024-06-18 Roy Xie , Chengxuan Huang , Junlin Wang , Bhuwan Dhingra

The mathematical capabilities of AI systems are complex and multifaceted. Most existing research has predominantly focused on the correctness of AI-generated solutions to mathematical problems. In this work, we argue that beyond producing…

Computation and Language · Computer Science 2024-10-25 Junyi Ye , Jingyi Gu , Xinyun Zhao , Wenpeng Yin , Guiling Wang

The demand for Large Language Models (LLMs) at multiple scales, capable of sophisticated and sound mathematical reasoning, continues to grow. However, the development of performant mathematical LLMs is often bottlenecked by the scarcity of…

Computation and Language · Computer Science 2025-11-05 Chaitanya Manem , Pratik Prabhanjan Brahma , Prakamya Mishra , Zicheng Liu , Emad Barsoum

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a…

Computation and Language · Computer Science 2026-05-27 Wenda Xu , Sweta Agrawal , Vilém Zouhar , Markus Freitag , Daniel Deutsch

Materials discovery and design are essential for advancing technology across various industries by enabling the development of application-specific materials. Recent research has leveraged Large Language Models (LLMs) to accelerate this…

Computation and Language · Computer Science 2025-02-11 Shrinidhi Kumbhar , Venkatesh Mishra , Kevin Coutinho , Divij Handa , Ashif Iquebal , Chitta Baral

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated…

Computation and Language · Computer Science 2025-02-25 Qin Zhu , Fei Huang , Runyu Peng , Keming Lu , Bowen Yu , Qinyuan Cheng , Xipeng Qiu , Xuanjing Huang , Junyang Lin

Personalized tutoring, teacher training, and education research need access to \emph{targeted} synthetic misconceptions, but privacy and IRB constraints make labelled corpora of real student errors scarce. LLMs could in principle generate…

Computation and Language · Computer Science 2026-05-29 Xinming Yang , Jun Li

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics…

Artificial Intelligence · Computer Science 2025-10-21 Jie Zhang , Cezara Petrui , Kristina Nikolić , Florian Tramèr

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive…

Artificial Intelligence · Computer Science 2026-05-19 Qingchuan Ma , Yuexiao Ma , Yongkang Xie , Tianyu Xie , Xiawu Zheng , Rongrong Ji

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky
‹ Prev 1 2 3 10 Next ›