Related papers: PredictaBoard: Benchmarking LLM Score Predictabili…

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents

Evaluating Large Language Models (LLMs) as general-purpose agents is essential for understanding their capabilities and facilitating their integration into practical applications. However, the evaluation process presents substantial…

Computation and Language · Computer Science 2024-12-25 Chang Ma , Junlei Zhang , Zhihao Zhu , Cheng Yang , Yujiu Yang , Yaohui Jin , Zhenzhong Lan , Lingpeng Kong , Junxian He

Large Language Models for Predictive Analysis: How Far Are They?

Predictive analysis is a cornerstone of modern decision-making, with applications in various domains. Large Language Models (LLMs) have emerged as powerful tools in enabling nuanced, knowledge-intensive conversations, thus aiding in complex…

Computation and Language · Computer Science 2025-05-26 Qin Chen , Yuanyi Ren , Xiaojun Ma , Yuyang Shi

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically…

Artificial Intelligence · Computer Science 2025-11-21 Zhenyu Bi , Gaurav Srivastava , Yang Li , Meng Lu , Swastik Roy , Morteza Ziyadi , Xuan Wang

Scalable Delphi: Large Language Models for Structured Risk Estimation

Quantitative risk assessment in high-stakes domains relies on structured expert elicitation to estimate unobservable properties. The gold standard - the Delphi method - produces calibrated, auditable judgments but requires months of…

Artificial Intelligence · Computer Science 2026-02-10 Tobias Lorenz , Mario Fritz

Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard…

Computation and Language · Computer Science 2024-12-25 Haonan Li , Xudong Han , Zenan Zhai , Honglin Mu , Hao Wang , Zhenxuan Zhang , Yilin Geng , Shom Lin , Renxi Wang , Artem Shelmanov , Xiangyu Qi , Yuxia Wang , Donghai Hong , Youliang Yuan , Meng Chen , Haoqin Tu , Fajri Koto , Tatsuki Kuribayashi , Cong Zeng , Rishabh Bhardwaj , Bingchen Zhao , Yawen Duan , Yi Liu , Emad A. Alghamdi , Yaodong Yang , Yinpeng Dong , Soujanya Poria , Pengfei Liu , Zhengzhong Liu , Xuguang Ren , Eduard Hovy , Iryna Gurevych , Preslav Nakov , Monojit Choudhury , Timothy Baldwin

Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge

The growing integration of Large Language Models (LLMs) into critical societal domains has raised concerns about embedded biases that can perpetuate stereotypes and undermine fairness. Such biases may stem from historical inequalities in…

Computation and Language · Computer Science 2025-10-17 Riccardo Cantini , Alessio Orsino , Massimo Ruggiero , Domenico Talia

Do LLMs estimate uncertainty well in instruction-following?

Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs' instruction-following…

Artificial Intelligence · Computer Science 2025-03-31 Juyeon Heo , Miao Xiong , Christina Heinze-Deml , Jaya Narain

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want…

Machine Learning · Computer Science 2025-08-26 Federico Errica , Giuseppe Siracusano , Davide Sanvito , Roberto Bifulco

How Uncertain Is the Grade? A Benchmark of Uncertainty Metrics for LLM-Based Automatic Assessment

The rapid rise of large language models (LLMs) is reshaping the landscape of automatic assessment in education. While these systems demonstrate substantial advantages in adaptability to diverse question types and flexibility in output…

Artificial Intelligence · Computer Science 2026-02-19 Hang Li , Kaiqi Yang , Xianxuan Long , Fedor Filippov , Yucheng Chu , Yasemin Copur-Gencturk , Peng He , Cory Miller , Namsoo Shin , Joseph Krajcik , Hui Liu , Jiliang Tang

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

Although demonstrating remarkable performance on reasoning tasks, Large Language Models (LLMs) still tend to fabricate unreliable responses when confronted with problems that are unsolvable or beyond their capability, severely undermining…

Computation and Language · Computer Science 2025-11-13 Boyang Xue , Qi Zhu , Rui Wang , Sheng Wang , Hongru Wang , Minda Hu , Fei Mi , Yasheng Wang , Lifeng Shang , Qun Liu , Kam-Fai Wong

CreditAudit: 2$^\text{nd}$ Dimension for LLM Evaluation and Selection

Leaderboard scores on public benchmarks have been steadily rising and converging, with many frontier language models now separated by only marginal differences. However, these scores often fail to match users' day to day experience, because…

Artificial Intelligence · Computer Science 2026-02-05 Yiliang Song , Hongjun An , Jiangong Xiao , Haofei Zhao , Jiawei Shao , Xuelong Li

Risk Management for Mitigating Benchmark Failure Modes: BenchRisk

Large language model (LLM) benchmarks inform LLM use decisions (e.g., "is this LLM safe to deploy for my use case and context?"). However, benchmarks may be rendered unreliable by various failure modes that impact benchmark bias, variance,…

Software Engineering · Computer Science 2025-10-27 Sean McGregor , Victor Lu , Vassil Tashev , Armstrong Foundjem , Aishwarya Ramasethu , Sadegh AlMahdi Kazemi Zarkouei , Chris Knotz , Kongtao Chen , Alicia Parrish , Anka Reuel , Heather Frase

How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench

We investigate the predictability of large language model (LLM) capabilities: given records of past experiments using different model families, numbers of parameters, tasks, and numbers of in-context examples, can we accurately predict LLM…

Computation and Language · Computer Science 2023-11-01 Qinyuan Ye , Harvey Yiyun Fu , Xiang Ren , Robin Jia

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

Large Language Models (LLMs) have exhibited great performance in autonomously calling various tools in external environments, leading to better problem solving and task automation capabilities. However, these external tools also amplify…

Cryptography and Security · Computer Science 2025-09-10 Hongfei Xia , Hongru Wang , Zeming Liu , Qian Yu , Yuhang Guo , Haifeng Wang

Finding Blind Spots in Evaluator LLMs with Interpretable Checklists

Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and development decisions. However, concerns persist over the accuracy of these assessments and the potential…

Computation and Language · Computer Science 2024-11-27 Sumanth Doddapaneni , Mohammed Safi Ur Rahman Khan , Sshubam Verma , Mitesh M. Khapra

Overconfidence in LLM-as-a-Judge: Diagnosis and Confidence-Driven Solution

Large Language Models (LLMs) are widely used as automated judges, where practical value depends on both accuracy and trustworthy, risk-aware judgments. Existing approaches predominantly focus on accuracy, overlooking the necessity of…

Artificial Intelligence · Computer Science 2025-08-19 Zailong Tian , Zhuoheng Han , Yanzhe Chen , Haozhe Xu , Xi Yang , Richeng Xuan , Houfeng Wang , Lizi Liao

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to…

Computation and Language · Computer Science 2026-03-17 Leo Schwinn , Moritz Ladenburger , Tim Beyer , Mehrnaz Mofakhami , Gauthier Gidel , Stephan Günnemann

Analyzing Uncertainty of LLM-as-a-Judge: Interval Evaluations with Conformal Prediction

LLM-as-a-judge has become a promising paradigm for using large language models (LLMs) to evaluate natural language generation (NLG), but the uncertainty of its evaluation remains underexplored. This lack of reliability may limit its…

Computation and Language · Computer Science 2025-09-24 Huanxin Sheng , Xinyi Liu , Hangfeng He , Jieyu Zhao , Jian Kang

JudgeBench: A Benchmark for Evaluating LLM-based Judges

LLM-based judges have emerged as a scalable alternative to human evaluation and are increasingly used to assess, compare, and improve models. However, the reliability of LLM-based judges themselves is rarely scrutinized. As LLMs become more…

Artificial Intelligence · Computer Science 2025-04-08 Sijun Tan , Siyuan Zhuang , Kyle Montgomery , William Y. Tang , Alejandro Cuadron , Chenguang Wang , Raluca Ada Popa , Ion Stoica