Related papers: Quantifying Variance in Evaluation Benchmarks

Beyond Metrics: A Critical Analysis of the Variability in Large Language Model Evaluation Frameworks

As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of…

Artificial Intelligence · Computer Science 2024-08-01 Marco AF Pimentel , Clément Christophe , Tathagata Raha , Prateek Munjal , Praveen K Kanithi , Shadab Khan

tinyBenchmarks: evaluating LLMs with fewer examples

The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very…

Computation and Language · Computer Science 2024-05-28 Felipe Maia Polo , Lucas Weber , Leshem Choshen , Yuekai Sun , Gongjun Xu , Mikhail Yurochkin

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

Evaluating Variance in Visual Question Answering Benchmarks

Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Nikitha SR

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis…

Computation and Language · Computer Science 2024-12-06 Sourav Banerjee , Ayushi Agarwal , Eishkaran Singh

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly)…

Computation and Language · Computer Science 2024-07-04 Norah Alzahrani , Hisham Abdullah Alyahya , Yazeed Alnumay , Sultan Alrashed , Shaykhah Alsubaie , Yusef Almushaykeh , Faisal Mirza , Nouf Alotaibi , Nora Altwairesh , Areeb Alowisheq , M Saiful Bari , Haidar Khan

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of…

Computation and Language · Computer Science 2025-09-29 Kimberly Le Truong , Riccardo Fogliato , Hoda Heidari , Zhiwei Steven Wu

Fluid Language Model Benchmarking

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation.…

Computation and Language · Computer Science 2025-09-16 Valentin Hofmann , David Heineman , Ian Magnusson , Kyle Lo , Jesse Dodge , Maarten Sap , Pang Wei Koh , Chun Wang , Hannaneh Hajishirzi , Noah A. Smith

Don't Make Your LLM an Evaluation Benchmark Cheater

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

Towards Reproducible LLM Evaluation: Quantifying Uncertainty in LLM Benchmark Scores

Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the…

Computation and Language · Computer Science 2025-06-30 Robert E. Blackwell , Jon Barry , Anthony G. Cohn

How Benchmark Prediction from Fewer Data Misses the Mark

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.…

Computation and Language · Computer Science 2024-06-07 Melissa Ailem , Katerina Marazopoulou , Charlotte Siska , James Bono

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail…

Computation and Language · Computer Science 2026-02-16 Ziqian Zhang , Xingjian Hu , Yue Huang , Kai Zhang , Ruoxi Chen , Yixin Liu , Qingsong Wen , Kaidi Xu , Xiangliang Zhang , Neil Zhenqiang Gong , Lichao Sun

Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three…

Computation and Language · Computer Science 2026-01-08 Qi Qian , Chengsong Huang , Jingwen Xu , Changze Lv , Muling Wu , Wenhao Liu , Xiaohua Wang , Zhenghua Wang , Zisu Huang , Muzhao Tian , Jianhan Xu , Kun Hu , He-Da Wang , Yao Hu , Xuanjing Huang , Xiaoqing Zheng

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow…

Computation and Language · Computer Science 2026-02-20 Bogdan Kostić , Conor Fallon , Julian Risch , Alexander Löser

Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific…

Computation and Language · Computer Science 2024-01-30 Gonzalo Martínez , Javier Conde , Elena Merino-Gómez , Beatriz Bermúdez-Margaretto , José Alberto Hernández , Pedro Reviriego , Marc Brysbaert

A Survey on Large Language Model Benchmarks

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model…

Computation and Language · Computer Science 2025-08-22 Shiwen Ni , Guhong Chen , Shuaimin Li , Xuanang Chen , Siyi Li , Bingli Wang , Qiyao Wang , Xingjian Wang , Yifan Zhang , Liyang Fan , Chengming Li , Ruifeng Xu , Le Sun , Min Yang

Towards Reliable, Uncertainty-Aware Alignment

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model…

Machine Learning · Computer Science 2025-07-23 Debangshu Banerjee , Kintan Saha , Aditya Gopalan

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how…

Computation and Language · Computer Science 2025-07-29 Aviya Maimon , Amir DN Cohen , Gal Vishne , Shauli Ravfogel , Reut Tsarfaty

Instance-level Randomization: Toward More Stable LLM Evaluations

Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have…

Machine Learning · Computer Science 2025-09-17 Yiyang Li , Yonghuang Wu , Ying Luo , Liangtai Sun , Zishu Qin , Lin Qiu , Xuezhi Cao , Xunliang Cai