English
Related papers

Related papers: Decoding machine learning benchmarks

200 papers

The experiments covered by Machine Learning (ML) must consider two important aspects to assess the performance of a model: datasets and algorithms. Robust benchmarks are needed to evaluate the best classifiers. For this, one can adopt gold…

Benchmarking is a fundamental practice in machine learning (ML) for comparing the performance of classification algorithms. However, traditional evaluation methods often overlook a critical aspect: the joint consideration of dataset…

Machine Learning · Computer Science 2025-04-15 Lucas Cardoso , Vitor Santos , José Ribeiro , Regiane Kawasaki , Ricardo Prudêncio , Ronnie Alves

Evaluating large language models (LLMs) on comprehensive benchmarks is a cornerstone of their development, yet it's often computationally and financially prohibitive. While Item Response Theory (IRT) offers a promising path toward…

Artificial Intelligence · Computer Science 2025-10-07 Lele Liao , Qile Zhang , Ruofan Wu , Guanhua Fang

Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a…

Computers and Society · Computer Science 2024-11-06 Ziqi Xu , Sevvandi Kandanaarachchi , Cheng Soon Ong , Eirini Ntoutsi

Item Response Theory (IRT) aims to assess latent abilities of respondents based on the correctness of their answers in aptitude test items with different difficulty levels. In this paper, we propose the $\beta^3$-IRT model, which models…

Machine Learning · Statistics 2019-06-04 Yu Chen , Telmo Silva Filho , Ricardo B. C. Prudêncio , Tom Diethe , Peter Flach

Robust validation of Machine Learning (ML) models is essential, but traditional data partitioning approaches often ignore the intrinsic quality of each instance. This study proposes the use of Item Response Theory (IRT) parameters to…

Machine Learning · Computer Science 2025-08-15 Lucas Cardoso , Vitor Santos , José Ribeiro Filho , Ricardo Prudêncio , Regiane Kawasaki , Ronnie Alves

In this article, we propose a novel probabilistic framework to improve the accuracy of a weighted majority voting algorithm. In order to assign higher weights to the classifiers which can correctly classify hard-to-classify instances, we…

Machine Learning · Statistics 2019-11-13 Ziheng Chen , Hongshik Ahn

Model evaluation is a critical component in supervised machine learning classification analyses. Traditional metrics do not currently incorporate case difficulty. This renders the classification results unbenchmarked for generalization.…

Machine Learning · Computer Science 2023-02-10 Adrienne Kline , Joon Lee

Item Response Theory (IRT) has been proposed within the field of Educational Psychometrics to assess student ability as well as test question difficulty and discrimination power. More recently, IRT has been applied to evaluate machine…

Machine Learning · Statistics 2023-08-01 Sevvandi Kandanaarachchi , Kate Smith-Miles

Evaluation of large language models (LLMs) is increasingly critical, yet standard benchmarking methods rely on average accuracy, overlooking both the inherent stochasticity of LLM outputs and the heterogeneity of benchmark items. Item…

Machine Learning · Statistics 2026-05-11 Xinhao Qu , Qiang Heng , Hao Zeng , Xiaoqian Liu

Although fundamental to the advancement of Machine Learning, the classic evaluation metrics extracted from the confusion matrix, such as precision and F1, are limited. Such metrics only offer a quantitative view of the models' performance,…

The evaluation of large language models (LLMs) via benchmarks is widespread, yet inconsistencies between different leaderboards and poor separability among top models raise concerns about their ability to accurately reflect authentic model…

Computation and Language · Computer Science 2026-01-19 Hongli Zhou , Hui Huang , Ziqing Zhao , Lvyuan Han , Huicheng Wang , Kehai Chen , Muyun Yang , Wei Bao , Jian Dong , Bing Xu , Conghui Zhu , Hailong Cao , Tiejun Zhao

Evaluation of NLP methods requires testing against a previously vetted gold-standard test set and reporting standard metrics (accuracy/precision/recall/F1). The current assumption is that all items in a given test set are equal with regards…

Computation and Language · Computer Science 2016-09-26 John P. Lalor , Hao Wu , Hong Yu

Accuracy-based evaluation of Large Language Models (LLMs) measures benchmark-specific performance rather than underlying medical competency: it treats all questions as equally informative, conflates model ability with item characteristics,…

Computation and Language · Computer Science 2026-04-07 Zhimeng Luo , Lixin Wu , Adam Frisch , Daqing He

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the…

Machine Learning · Computer Science 2024-09-16 James Sharpnack , Phoebe Mulcaire , Klinton Bicknell , Geoff LaFlair , Kevin Yancey

Evaluating models and datasets in computer vision remains a challenging task, with most leaderboards relying solely on accuracy. While accuracy is a popular metric for model evaluation, it provides only a coarse assessment by considering a…

Computer Vision and Pattern Recognition · Computer Science 2024-09-09 Rahul Ramachandran , Tejal Kulkarni , Charchit Sharma , Deepak Vijaykeerthy , Vineeth N Balasubramanian

Item Response Theory (IRT) is a powerful statistical approach for evaluating test items and determining test taker abilities through response analysis. An IRT model that better fits the data leads to more accurate latent trait estimates. In…

Machine Learning · Statistics 2024-10-03 Joakim Wallmark , Maria Josefsson , Marie Wiberg

Large language models (LLMs) have demonstrated exceptional performance across a wide range of natural language tasks. However, selecting the optimal LLM to respond to a user query often necessitates a delicate balance between performance…

Artificial Intelligence · Computer Science 2025-06-24 Wei Song , Zhenya Huang , Cheng Cheng , Weibo Gao , Bihan Xu , GuanHao Zhao , Fei Wang , Runze Wu

Traditional methods for determining assessment item parameters, such as difficulty and discrimination, rely heavily on expensive field testing to collect student performance data for Item Response Theory (IRT) calibration. This study…

Computation and Language · Computer Science 2026-01-07 Christopher Ormerod

Evaluating large language models (LLMs) typically requires thousands of benchmark items, making the process expensive, slow, and increasingly impractical at scale. Existing evaluation protocols rely on average accuracy over fixed item sets,…

Computation and Language · Computer Science 2026-02-03 Peiyu Li , Xiuxiu Tang , Si Chen , Ying Cheng , Ronald Metoyer , Ting Hua , Nitesh V. Chawla
‹ Prev 1 2 3 10 Next ›