English
Related papers

Related papers: Quantifying Variance in Evaluation Benchmarks

200 papers

As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge that requires careful consideration of…

Artificial Intelligence · Computer Science 2024-08-01 Marco AF Pimentel , Clément Christophe , Tathagata Raha , Prateek Munjal , Praveen K Kanithi , Shadab Khan

The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very…

Computation and Language · Computer Science 2024-05-28 Felipe Maia Polo , Lucas Weber , Leshem Choshen , Yuekai Sun , Gongjun Xu , Mikhail Yurochkin

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

Multimodal large language models (MLLMs) have emerged as powerful tools for visual question answering (VQA), enabling reasoning and contextual understanding across visual and textual modalities. Despite their advancements, the evaluation of…

Computer Vision and Pattern Recognition · Computer Science 2025-08-05 Nikitha SR

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis…

Computation and Language · Computer Science 2024-12-06 Sourav Banerjee , Ayushi Agarwal , Eishkaran Singh

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly)…

Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of…

Computation and Language · Computer Science 2025-09-29 Kimberly Le Truong , Riccardo Fogliato , Hoda Heidari , Zhiwei Steven Wu

Language model (LM) benchmarking faces several challenges: comprehensive evaluations are costly, benchmarks often fail to measure the intended capabilities, and evaluation quality can degrade due to labeling errors and benchmark saturation.…

Computation and Language · Computer Science 2025-09-16 Valentin Hofmann , David Heineman , Ian Magnusson , Kyle Lo , Jesse Dodge , Maarten Sap , Pang Wei Koh , Chun Wang , Hannaneh Hajishirzi , Noah A. Smith

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

Large language models (LLMs) are stochastic, and not all models give deterministic answers, even when setting temperature to zero with a fixed random seed. However, few benchmark studies attempt to quantify uncertainty, partly due to the…

Computation and Language · Computer Science 2025-06-30 Robert E. Blackwell , Jon Barry , Anthony G. Cohn

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance.…

Computation and Language · Computer Science 2024-06-07 Melissa Ailem , Katerina Marazopoulou , Charlotte Siska , James Bono

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail…

Computation and Language · Computer Science 2026-02-16 Ziqian Zhang , Xingjian Hu , Yue Huang , Kai Zhang , Ruoxi Chen , Yixin Liu , Qingsong Wen , Kaidi Xu , Xiangliang Zhang , Neil Zhenqiang Gong , Lichao Sun

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three…

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow…

Computation and Language · Computer Science 2026-02-20 Bogdan Kostić , Conor Fallon , Julian Risch , Alexander Löser

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific…

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model…

Computation and Language · Computer Science 2025-08-22 Shiwen Ni , Guhong Chen , Shuaimin Li , Xuanang Chen , Siyi Li , Bingli Wang , Qiyao Wang , Xingjian Wang , Yifan Zhang , Liyang Fan , Chengming Li , Ruifeng Xu , Le Sun , Min Yang

Alignment of large language models (LLMs) typically involves training a reward model on preference data, followed by policy optimization with respect to the reward model. However, optimizing policies with respect to a single reward model…

Machine Learning · Computer Science 2025-07-23 Debangshu Banerjee , Kintan Saha , Aditya Gopalan

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how…

Computation and Language · Computer Science 2025-07-29 Aviya Maimon , Amir DN Cohen , Gal Vishne , Shauli Ravfogel , Reut Tsarfaty

Evaluations of large language models (LLMs) suffer from instability, where small changes of random factors such as few-shot examples can lead to drastic fluctuations of scores and even model rankings. Moreover, different LLMs can have…

Machine Learning · Computer Science 2025-09-17 Yiyang Li , Yonghuang Wu , Ying Luo , Liangtai Sun , Zishu Qin , Lin Qiu , Xuezhi Cao , Xunliang Cai
‹ Prev 1 2 3 10 Next ›