Related papers: tinyBenchmarks: evaluating LLMs with fewer example…

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Benchmark^2: Systematic Evaluation of LLM Benchmarks

The rapid proliferation of benchmarks for evaluating large language models (LLMs) has created an urgent need for systematic methods to assess benchmark quality itself. We propose Benchmark^2, a comprehensive framework comprising three…

Computation and Language · Computer Science 2026-01-08 Qi Qian , Chengsong Huang , Jingwen Xu , Changze Lv , Muling Wu , Wenhao Liu , Xiaohua Wang , Zhenghua Wang , Zisu Huang , Muzhao Tian , Jianhan Xu , Kun Hu , He-Da Wang , Yao Hu , Xuanjing Huang , Xiaoqing Zheng

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

Large Language Model (LLM) leaderboards based on benchmark rankings are regularly used to guide practitioners in model selection. Often, the published leaderboard rankings are taken at face value - we show this is a (potentially costly)…

Computation and Language · Computer Science 2024-07-04 Norah Alzahrani , Hisham Abdullah Alyahya , Yazeed Alnumay , Sultan Alrashed , Shaykhah Alsubaie , Yusef Almushaykeh , Faisal Mirza , Nouf Alotaibi , Nora Altwairesh , Areeb Alowisheq , M Saiful Bari , Haidar Khan

Quantifying Variance in Evaluation Benchmarks

Evaluation benchmarks are the cornerstone of measuring capabilities of large language models (LLMs), as well as driving progress in said capabilities. Originally designed to make claims about capabilities (or lack thereof) in fully…

Machine Learning · Computer Science 2024-06-17 Lovish Madaan , Aaditya K. Singh , Rylan Schaeffer , Andrew Poulton , Sanmi Koyejo , Pontus Stenetorp , Sharan Narang , Dieuwke Hupkes

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including…

Software Engineering · Computer Science 2025-11-05 Xing Hu , Feifei Niu , Junkai Chen , Xin Zhou , Junwei Zhang , Junda He , Xin Xia , David Lo

On Robustness and Reliability of Benchmark-Based Evaluation of LLMs

Large Language Models (LLMs) effectiveness is usually evaluated by means of benchmarks such as MMLU, ARC-C, or HellaSwag, where questions are presented in their original wording, thus in a fixed, standardized format. However, real-world…

Computation and Language · Computer Science 2025-09-05 Riccardo Lunardi , Vincenzo Della Mea , Stefano Mizzaro , Kevin Roitero

Enterprise Large Language Model Evaluation Benchmark

Large Language Models (LLMs) ) have demonstrated promise in boosting productivity across AI-powered tools, yet existing benchmarks like Massive Multitask Language Understanding (MMLU) inadequately assess enterprise-specific task…

Artificial Intelligence · Computer Science 2025-06-26 Liya Wang , David Yi , Damien Jose , John Passarelli , James Gao , Jordan Leventis , Kang Li

How Benchmark Prediction from Fewer Data Misses the Mark

Large language model (LLM) evaluation is increasingly costly, prompting interest in methods that speed up evaluation by shrinking benchmark datasets. Benchmark prediction (also called efficient LLM evaluation) aims to select a small subset…

Machine Learning · Computer Science 2025-06-10 Guanhua Zhang , Florian E. Dorner , Moritz Hardt

Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific…

Computation and Language · Computer Science 2024-01-30 Gonzalo Martínez , Javier Conde , Elena Merino-Gómez , Beatriz Bermúdez-Margaretto , José Alberto Hernández , Pedro Reviriego , Marc Brysbaert

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

Existing large language model (LLM) evaluation benchmarks primarily focus on English, while current multilingual tasks lack parallel questions that specifically assess cross-linguistic reasoning abilities. This dual limitation makes it…

Computation and Language · Computer Science 2025-05-27 Weihao Xuan , Rui Yang , Heli Qi , Qingcheng Zeng , Yunze Xiao , Aosong Feng , Dairui Liu , Yun Xing , Junjue Wang , Fan Gao , Jinghui Lu , Yuang Jiang , Huitao Li , Xin Li , Kunyu Yu , Ruihai Dong , Shangding Gu , Yuekang Li , Xiaofei Xie , Felix Juefei-Xu , Foutse Khomh , Osamu Yoshie , Qingyu Chen , Douglas Teodoro , Nan Liu , Randy Goebel , Lei Ma , Edison Marrese-Taylor , Shijian Lu , Yusuke Iwasawa , Yutaka Matsuo , Irene Li

Enterprise Benchmarks for Large Language Model Evaluation

The advancement of large language models (LLMs) has led to a greater challenge of having a rigorous and systematic evaluation of complex tasks performed, especially in enterprise applications. Therefore, LLMs need to be able to benchmark…

Computation and Language · Computer Science 2024-10-18 Bing Zhang , Mikio Takeuchi , Ryo Kawahara , Shubhi Asthana , Md. Maruf Hossain , Guang-Jie Ren , Kate Soule , Yada Zhu

Towards Multilingual LLM Evaluation for European Languages

The rise of Large Language Models (LLMs) has revolutionized natural language processing across numerous languages and tasks. However, evaluating LLM performance in a consistent and meaningful way across multiple European languages remains…

Computation and Language · Computer Science 2024-10-18 Klaudia Thellmann , Bernhard Stadler , Michael Fromm , Jasper Schulze Buschhoff , Alex Jude , Fabio Barth , Johannes Leveling , Nicolas Flores-Herr , Joachim Köhler , René Jäkel , Mehdi Ali

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Large Language Models in the Clinic: A Comprehensive Benchmark

The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical…

Computation and Language · Computer Science 2024-10-17 Fenglin Liu , Zheng Li , Hongjian Zhou , Qingyu Yin , Jingfeng Yang , Xianfeng Tang , Chen Luo , Ming Zeng , Haoming Jiang , Yifan Gao , Priyanka Nigam , Sreyashi Nag , Bing Yin , Yining Hua , Xuan Zhou , Omid Rohanian , Anshul Thakur , Lei Clifton , David A. Clifton

LIME: Less Is More for MLLM Evaluation

Multimodal Large Language Models (MLLMs) are evaluated on various benchmarks, such as image captioning, visual question answering, and reasoning. However, many of these benchmarks include overly simple or uninformative samples, complicating…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 King Zhu , Qianbo Zang , Shian Jia , Siwei Wu , Feiteng Fang , Yizhi Li , Shawn Gavin , Tuney Zheng , Jiawei Guo , Bo Li , Haoning Wu , Xingwei Qu , Jian Yang , Zachary Liu , Xiang Yue , J. H. Liu , Chenghua Lin , Min Yang , Shiwen Ni , Wenhao Huang , Ge Zhang

Do Large Language Model Benchmarks Test Reliability?

When deploying large language models (LLMs), it is important to ensure that these models are not only capable, but also reliable. Many benchmarks have been created to track LLMs' growing capabilities, however there has been no similar focus…

Machine Learning · Computer Science 2025-02-06 Joshua Vendrow , Edward Vendrow , Sara Beery , Aleksander Madry

Don't Make Your LLM an Evaluation Benchmark Cheater

Large language models~(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for…

Computation and Language · Computer Science 2023-11-06 Kun Zhou , Yutao Zhu , Zhipeng Chen , Wentong Chen , Wayne Xin Zhao , Xu Chen , Yankai Lin , Ji-Rong Wen , Jiawei Han

IQ Test for LLMs: An Evaluation Framework for Uncovering Core Skills in LLMs

Current evaluations of large language models (LLMs) rely on benchmark scores, but it is difficult to interpret what these individual scores reveal about a model's overall skills. Specifically, as a community we lack understanding of how…

Computation and Language · Computer Science 2025-07-29 Aviya Maimon , Amir DN Cohen , Gal Vishne , Shauli Ravfogel , Reut Tsarfaty

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

A Survey on Benchmarks of Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and…

Computation and Language · Computer Science 2024-09-09 Jian Li , Weiheng Lu , Hao Fei , Meng Luo , Ming Dai , Min Xia , Yizhang Jin , Zhenye Gan , Ding Qi , Chaoyou Fu , Ying Tai , Wankou Yang , Yabiao Wang , Chengjie Wang