Related papers: Do Large Language Model Benchmarks Test Reliabilit…

The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance?

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis…

Computation and Language · Computer Science 2024-12-06 Sourav Banerjee , Ayushi Agarwal , Eishkaran Singh

Don't Make Your LLM an Evaluation Benchmark Cheater

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

Pitfalls of Evaluating Language Models with Open Benchmarks

Open Large Language Model (LLM) benchmarks, such as HELM and BIG-Bench, provide standardized and transparent evaluation protocols that support comparative analysis, reproducibility, and systematic progress tracking in Language Model (LM)…

Computation and Language · Computer Science 2026-01-08 Md. Najib Hasan , Md Mahadi Hassan Sibat , Mohammad Fakhruddin Babar , Souvika Sarkar , Monowar Hasan , Santu Karmaker

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works…

Computation and Language · Computer Science 2026-01-21 Xunyi Jiang , Dingyi Chang , Julian McAuley , Xin Xu

Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench

Recent advancements in Language Models (LMs) have catalyzed the creation of multiple benchmarks, designed to assess these models' general capabilities. A crucial task, however, is assessing the validity of the benchmarks themselves. This is…

Computation and Language · Computer Science 2024-09-13 Yotam Perlitz , Ariel Gera , Ofir Arviv , Asaf Yehudai , Elron Bandel , Eyal Shnarch , Michal Shmueli-Scheuer , Leshem Choshen

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including…

Software Engineering · Computer Science 2025-11-05 Xing Hu , Feifei Niu , Junkai Chen , Xin Zhou , Junwei Zhang , Junda He , Xin Xia , David Lo

Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world…

Software Engineering · Computer Science 2025-04-01 Daniel Ramos , Claudia Mamede , Kush Jain , Paulo Canelas , Catarina Gamboa , Claire Le Goues

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Evaluating large language models (LLMs) is crucial for both assessing their capabilities and identifying safety or robustness issues prior to deployment. Reliably measuring abstract and complex phenomena such as 'safety' and 'robustness'…

Computation and Language · Computer Science 2025-11-10 Andrew M. Bean , Ryan Othniel Kearns , Angelika Romanou , Franziska Sofia Hafner , Harry Mayne , Jan Batzner , Negar Foroutan , Chris Schmitz , Karolina Korgul , Hunar Batra , Oishi Deb , Emma Beharry , Cornelius Emde , Thomas Foster , Anna Gausen , María Grandury , Simeng Han , Valentin Hofmann , Lujain Ibrahim , Hazel Kim , Hannah Rose Kirk , Fangru Lin , Gabrielle Kaili-May Liu , Lennart Luettgau , Jabez Magomere , Jonathan Rystrøm , Anna Sotnikova , Yushi Yang , Yilun Zhao , Adel Bibi , Antoine Bosselut , Ronald Clark , Arman Cohan , Jakob Foerster , Yarin Gal , Scott A. Hale , Inioluwa Deborah Raji , Christopher Summerfield , Philip H. S. Torr , Cozmin Ududec , Luc Rocher , Adam Mahdi

A Survey of Confidence Estimation and Calibration in Large Language Models

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their…

Computation and Language · Computer Science 2024-03-26 Jiahui Geng , Fengyu Cai , Yuxia Wang , Heinz Koeppl , Preslav Nakov , Iryna Gurevych

Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarks; a tradition inherited from…

Computation and Language · Computer Science 2025-03-17 Ahmed Alaa , Thomas Hartvigsen , Niloufar Golchini , Shiladitya Dutta , Frances Dean , Inioluwa Deborah Raji , Travis Zack

tinyBenchmarks: evaluating LLMs with fewer examples

The versatility of large language models (LLMs) led to the creation of diverse benchmarks that thoroughly test a variety of language models' abilities. These benchmarks consist of tens of thousands of examples making evaluation of LLMs very…

Computation and Language · Computer Science 2024-05-28 Felipe Maia Polo , Lucas Weber , Leshem Choshen , Yuekai Sun , Gongjun Xu , Mikhail Yurochkin

A Survey on Large Language Model Benchmarks

In recent years, with the rapid development of the depth and breadth of large language models' capabilities, various corresponding evaluation benchmarks have been emerging in increasing numbers. As a quantitative assessment tool for model…

Computation and Language · Computer Science 2025-08-22 Shiwen Ni , Guhong Chen , Shuaimin Li , Xuanang Chen , Siyi Li , Bingli Wang , Qiyao Wang , Xingjian Wang , Yifan Zhang , Liyang Fan , Chengming Li , Ruifeng Xu , Le Sun , Min Yang

Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three…

Computation and Language · Computer Science 2026-01-08 Qi Qian , Chengsong Huang , Jingwen Xu , Changze Lv , Muling Wu , Wenhao Liu , Xiaohua Wang , Zhenghua Wang , Zisu Huang , Muzhao Tian , Jianhan Xu , Kun Hu , He-Da Wang , Yao Hu , Xuanjing Huang , Xiaoqing Zheng

Multilingual European Language Models: Benchmarking Approaches and Challenges

The breakthrough of generative large language models (LLMs) that can solve different tasks through chat interaction has led to a significant increase in the use of general benchmarks to assess the quality or performance of these models…

Computation and Language · Computer Science 2025-04-03 Fabio Barth , Georg Rehm

Benchmarking Benchmark Leakage in Large Language Models

Amid the expanding use of pre-training data, the phenomenon of benchmark dataset leakage has become increasingly prominent, exacerbated by opaque training processes and the often undisclosed inclusion of supervised data in contemporary…

Computation and Language · Computer Science 2024-04-30 Ruijie Xu , Zengzhi Wang , Run-Ze Fan , Pengfei Liu

Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks -…

Computation and Language · Computer Science 2026-02-13 Eddie Yang , Dashun Wang

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Large Language Models (LLMs) have recently gained significant attention due to their remarkable capabilities in performing diverse tasks across various domains. However, a thorough evaluation of these models is crucial before deploying them…

Computation and Language · Computer Science 2024-10-04 Md Tahmid Rahman Laskar , Sawsan Alqahtani , M Saiful Bari , Mizanur Rahman , Mohammad Abdullah Matin Khan , Haidar Khan , Israt Jahan , Amran Bhuiyan , Chee Wei Tan , Md Rizwan Parvez , Enamul Hoque , Shafiq Joty , Jimmy Huang