Related papers: Algorithmically Establishing Trust in Evaluators

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability;…

Machine Learning · Statistics 2026-01-30 Mingyuan Xu , Xinzi Tan , Jiawei Wu , Doudou Zhou

Reference-Free Rating of LLM Responses via Latent Information

How reliable are single-response LLM-as-a-judge ratings without references, and can we obtain fine-grained, deterministic scores in this setting? We study the common practice of asking a judge model to assign Likert-scale scores to…

Computation and Language · Computer Science 2025-09-30 Leander Girrbach , Chi-Ping Su , Tankred Saanum , Richard Socher , Eric Schulz , Zeynep Akata

Self-Taught Evaluators

Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human…

Computation and Language · Computer Science 2024-08-09 Tianlu Wang , Ilia Kulikov , Olga Golovneva , Ping Yu , Weizhe Yuan , Jane Dwivedi-Yu , Richard Yuanzhe Pang , Maryam Fazel-Zarandi , Jason Weston , Xian Li

How to Correctly Report LLM-as-a-Judge Evaluations

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple…

Machine Learning · Computer Science 2026-02-10 Chungpa Lee , Thomas Zeng , Jongwon Jeong , Jy-yong Sohn , Kangwook Lee

Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are…

Computation and Language · Computer Science 2026-03-06 Artem Vazhentsev , Maria Marina , Daniil Moskovskiy , Sergey Pletenev , Mikhail Seleznyov , Mikhail Salnikov , Elena Tutubalina , Vasily Konovalov , Irina Nikishina , Alexander Panchenko , Viktor Moskvoretskii

An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability

As large language models (LLMs) continue to advance, reliable evaluation methods are essential particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its…

Computation and Language · Computer Science 2025-06-17 Yusuke Yamauchi , Taro Yano , Masafumi Oyamada

Evaluating LLM-Contaminated Crowdsourcing Data Without Ground Truth

The recent success of generative AI highlights the crucial role of high-quality human feedback in building trustworthy AI systems. However, the increasing use of large language models (LLMs) by crowdsourcing workers poses a significant…

Artificial Intelligence · Computer Science 2025-11-07 Yichi Zhang , Jinlong Pang , Zhaowei Zhu , Yang Liu

Ranking Large Language Models without Ground Truth

Evaluation and ranking of large language models (LLMs) has become an important problem with the proliferation of these models and their impact. Evaluation methods either require human responses which are expensive to acquire or use pairs of…

Computation and Language · Computer Science 2024-06-11 Amit Dhurandhar , Rahul Nair , Moninder Singh , Elizabeth Daly , Karthikeyan Natesan Ramamurthy

Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement

We present a principled approach to provide LLM-based evaluation with a rigorous guarantee of human agreement. We first propose that a reliable evaluation method should not uncritically rely on model preferences for pairwise evaluation, but…

Machine Learning · Computer Science 2024-07-29 Jaehun Jung , Faeze Brahman , Yejin Choi

LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency

Large language model (LLM) evaluation platforms increasingly rely on pairwise human judgments. These data are noisy, sparse, and non-uniform, yet leaderboards are reported with limited uncertainty quantification. We study this as…

Methodology · Statistics 2026-04-08 Jiachun Li , David Simchi-Levi , Will Wei Sun

Estimating the Accuracies of Multiple Classifiers Without Labeled Data

In various situations one is given only the predictions of multiple classifiers over a large unlabeled test data. This scenario raises the following questions: Without any labeled data and without any a-priori knowledge about the…

Machine Learning · Statistics 2014-10-31 Ariel Jaffe , Boaz Nadler , Yuval Kluger

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable…

Artificial Intelligence · Computer Science 2026-02-03 Junhyuk Choi , Sohhyung Park , Chanhee Cho , Hyeonchu Park , Bugeun Kim

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Reinforcement Learning (RL) has been shown to substantially improve the reasoning capability of small and large language models (LLMs), but existing approaches typically rely on verifiable rewards, hence ground truth labels. We propose an…

Computation and Language · Computer Science 2026-04-06 Yiyang Shen , Lifu Tu , Weiran Wang

TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness

Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications. However, concerns have arisen regarding the trustworthiness of LLMs outputs, particularly in…

Computation and Language · Computer Science 2024-05-08 Danna Zheng , Danyang Liu , Mirella Lapata , Jeff Z. Pan

LLM as a Meta-Judge: Synthetic Data for NLP Evaluation Metric Validation

Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exist only for English datasets. We propose \textit{LLM as a Meta-Judge}, a scalable framework that utilizes LLMs…

Computation and Language · Computer Science 2026-03-11 Lukáš Eigler , Jindřich Libovický , David Hurych

Large Language Models are Inconsistent and Biased Evaluators

The zero-shot capability of Large Language Models (LLMs) has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively…

Computation and Language · Computer Science 2024-05-06 Rickard Stureborg , Dimitris Alikaniotis , Yoshi Suhara

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be…

Computation and Language · Computer Science 2026-04-01 Robinson Ferrer , Damla Turgut , Zhongzhou Chen , Shashank Sonkar

Zero-Shot Grammar Competency Estimation Using Large Language Model Generated Pseudo Labels

Grammar competency estimation is essential for assessing linguistic proficiency in both written and spoken language; however, the spoken modality presents additional challenges due to its spontaneous, unstructured, and disfluent nature.…

Computation and Language · Computer Science 2025-11-18 Sourya Dipta Das , Shubham Kumar , Kuldeep Yadav

A-VERT: Agnostic Verification with Embedding Ranking Targets

The automatic evaluation of Language Model (LM) responses is a critical piece in the development of benchmarks and metrics, both for model training and quality assessment of production model endpoints. The current approaches to response…

Computation and Language · Computer Science 2025-10-03 Nicolás Aguirre , Ramiro Caso , Ramiro Rodríguez Colmeiro , Mauro Santelli , Joaquín Toranzo Calderón

EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning…

Computation and Language · Computer Science 2025-01-03 Raymond Bernard , Shaina Raza , Subhabrata Das , Rahul Murugan