Related papers: Decoding machine learning benchmarks

Data vs classifiers, who wins?

The experiments covered by Machine Learning (ML) must consider two important aspects to assess the performance of a model: datasets and algorithms. Robust benchmarks are needed to evaluate the best classifiers. For this, one can adopt gold…

Machine Learning · Computer Science 2021-11-03 Lucas F. F. Cardoso , Vitor C. A. Santos , Regiane S. Kawasaki Francês , Ricardo B. C. Prudêncio , Ronnie C. O. Alves

Enhancing Classifier Evaluation: A Fairer Benchmarking Strategy Based on Ability and Robustness

Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset…

Machine Learning · Computer Science 2025-04-15 Lucas Cardoso , Vitor Santos , José Ribeiro , Regiane Kawasaki , Ricardo Prudêncio , Ronnie Alves

Toward a unified framework for data-efficient evaluation of large language models

Evaluating large language models (LLMs) on comprehensive benchmarks is a cornerstone of their development, yet it's often computationally and financially prohibitive. While Item Response Theory (IRT) offers a promising path toward…

Artificial Intelligence · Computer Science 2025-10-07 Lele Liao , Qile Zhang , Ruofan Wu , Guanhua Fang

Fairness Evaluation with Item Response Theory

Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a…

Computers and Society · Computer Science 2024-11-06 Ziqi Xu , Sevvandi Kandanaarachchi , Cheng Soon Ong , Eirini Ntoutsi

$\beta^3$-IRT: A New Item Response Model and its Applications

Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $\beta^3$-IRT model, which models…

Machine Learning · Statistics 2019-06-04 Yu Chen , Telmo Silva Filho , Ricardo B. C. Prudêncio , Tom Diethe , Peter Flach

Beyond Random Sampling: Instance Quality-Based Data Partitioning via Item Response Theory

Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to…

Machine Learning · Computer Science 2025-08-15 Lucas Cardoso , Vitor Santos , José Ribeiro Filho , Ricardo Prudêncio , Regiane Kawasaki , Ronnie Alves

Item Response Theory based Ensemble in Machine Learning

In this article, we propose a novel probabilistic framework to improve the accuracy of a weighted majority voting algorithm. In order to assign higher weights to the classifiers which can correctly classify hard-to-classify instances, we…

Machine Learning · Statistics 2019-11-13 Ziheng Chen , Hongshik Ahn

Machine Learning Capability: A standardized metric using case difficulty with applications to individualized deployment of supervised machine learning

Model evaluation is a critical component in supervised machine learning classification analyses. Traditional metrics do not currently incorporate case difficulty. This renders the classification results unbenchmarked for generalization.…

Machine Learning · Computer Science 2023-02-10 Adrienne Kline , Joon Lee

Comprehensive Algorithm Portfolio Evaluation using Item Response Theory

Item Response Theory (IRT) has been proposed within the field of Educational Psychometrics to assess student ability as well as test question difficulty and discrimination power. More recently, IRT has been applied to evaluate machine…

Machine Learning · Statistics 2023-08-01 Sevvandi Kandanaarachchi , Kate Smith-Miles

An Interpretable and Scalable Framework for Evaluating Large Language Models

Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item…

Machine Learning · Statistics 2026-05-11 Xinhao Qu , Qiang Heng , Hao Zeng , Xiaoqian Liu

Standing on the shoulders of giants

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance,…

Machine Learning · Computer Science 2024-09-09 Lucas Felipe Ferraro Cardoso , José de Sousa Ribeiro Filho , Vitor Cirilo Araujo Santos , Regiane Silva Kawasaki Frances , Ronnie Cley de Oliveira Alves

Lost in Benchmarks? Rethinking Large Language Model Benchmarking with Item Response Theory

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model…

Computation and Language · Computer Science 2026-01-19 Hongli Zhou , Hui Huang , Ziqing Zhao , Lvyuan Han , Huicheng Wang , Kehai Chen , Muyun Yang , Wei Bao , Jian Dong , Bing Xu , Conghui Zhu , Hailong Cao , Tiejun Zhao

Building an Evaluation Scale using Item Response Theory

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards…

Computation and Language · Computer Science 2016-09-26 John P. Lalor , Hao Wu , Hong Yu

Measuring Competency, Not Performance: Item-Aware Evaluation Across Medical Benchmarks

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics,…

Computation and Language · Computer Science 2026-04-07 Zhimeng Luo , Lixin Wu , Adam Frisch , Daqing He

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the…

Machine Learning · Computer Science 2024-09-16 James Sharpnack , Phoebe Mulcaire , Klinton Bicknell , Geoff LaFlair , Kevin Yancey

On Evaluation of Vision Datasets and Models using Human Competency Frameworks

Evaluating models and datasets in computer vision remains a challenging task, with most leaderboards relying solely on accuracy. While accuracy is a popular metric for model evaluation, it provides only a coarse assessment by considering a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Rahul Ramachandran , Tejal Kulkarni , Charchit Sharma , Deepak Vijaykeerthy , Vineeth N Balasubramanian

Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales

Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In…

Machine Learning · Statistics 2024-10-03 Joakim Wallmark , Maria Josefsson , Marie Wiberg

IRT-Router: Effective and Interpretable Multi-LLM Routing via Item Response Theory

Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance…

Artificial Intelligence · Computer Science 2025-06-24 Wei Song , Zhenya Huang , Cheng Cheng , Weibo Gao , Bihan Xu , GuanHao Zhao , Fei Wang , Runze Wu

Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models

Traditional methods for determining assessment item parameters, such as difficulty and discrimination, rely heavily on expensive field testing to collect student performance data for Item Response Theory (IRT) calibration. This study…

Computation and Language · Computer Science 2026-01-07 Christopher Ormerod

Adaptive Testing for LLM Evaluation: A Psychometric Alternative to Static Benchmarks

Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets,…

Computation and Language · Computer Science 2026-02-03 Peiyu Li , Xiuxiu Tang , Si Chen , Ying Cheng , Ronald Metoyer , Ting Hua , Nitesh V. Chawla