Estimating problem difficulty without ground truth using Large Language Model comparisons

Marthe Ballon; Andres Algaba; Brecht Verbeken; Vincent Ginis

Estimating problem difficulty without ground truth using Large Language Model comparisons

Machine Learning 2025-12-17 v1 Artificial Intelligence

Authors: Marthe Ballon , Andres Algaba , Brecht Verbeken , Vincent Ginis

Abstract

Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation pipeline is a method for estimating problem difficulty. Current approaches, such as human calibration or performance-based scoring, fail to generalize to out-of-distribution problems, i.e. problems currently unsolvable by humans and LLMs, because they are not scalable, time-consuming, and ground truth dependent. Therefore, we propose a new method for estimating problem difficulty, LLM compare, that addresses these limitations. An LLM performs pairwise difficulty comparisons, and then Bradley-Terry scores are computed based on the outcomes. To validate our method, we first propose a conceptual framework that positions existing approaches on three orthogonal planes--construction, scale and dependence--identifying which quadrants a measure needs to occupy to score out-of-distribution problems. LLM compare naturally occupies all desirable quadrants as the first measure that is continuous and dynamic, model-agnostic and independent of ground truth information. As a second validation, we show that LLM compare demonstrates strong alignment with human annotations: Pearson $r \geq 0.80$ for $n=1876$ . Thirdly, we show that LLM compare is robust to hallucinations, with less than $6\%$ degradation in Pearson correlation for $10\%$ noise injection. Our work represents a significant step towards replacing time-consuming human annotations and synthetic data generation, and will be an important driver for curriculum design, model evaluation, and AI-assisted research ideation.

Keywords

large language model evaluation large language model benchmark evaluation

Cite

@article{arxiv.2512.14220,
  title  = {Estimating problem difficulty without ground truth using Large Language Model comparisons},
  author = {Marthe Ballon and Andres Algaba and Brecht Verbeken and Vincent Ginis},
  journal= {arXiv preprint arXiv:2512.14220},
  year   = {2025}
}

Comments

19 pages, 10 figures

Estimating problem difficulty without ground truth using Large Language Model comparisons

Abstract

Keywords

Cite

Comments

Related papers