Related papers: Calibrating LLM-Based Evaluator

When Can We Trust LLM Graders? Calibrating Confidence for Automated Assessment

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be…

Computation and Language · Computer Science 2026-04-01 Robinson Ferrer , Damla Turgut , Zhongzhou Chen , Shashank Sonkar

Aligning Black-box Language Models with Human Judgments

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs…

Computation and Language · Computer Science 2025-02-10 Gerrit J. J. van den Burg , Gen Suzuki , Wei Liu , Murat Sensoy

Explain then Rank: Scale Calibration of Neural Rankers Using Natural Language Explanations from LLMs

In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has…

Information Retrieval · Computer Science 2024-08-28 Puxuan Yu , Daniel Cohen , Hemank Lamba , Joel Tetreault , Alex Jaimes

Grade Like a Human: Rethinking Automated Assessment with Large Language Models

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a…

Artificial Intelligence · Computer Science 2024-05-31 Wenjing Xie , Juxin Niu , Chun Jason Xue , Nan Guan

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation…

Computer Vision and Pattern Recognition · Computer Science 2023-11-27 Yuanfeng Ji , Chongjian Ge , Weikai Kong , Enze Xie , Zhengying Liu , Zhengguo Li , Ping Luo

On Evaluating LLM Alignment by Evaluating LLMs as Judges

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves…

Computation and Language · Computer Science 2025-11-26 Yixin Liu , Pengfei Liu , Arman Cohan

Exploring the Reliability of Large Language Models as Customized Evaluators for Diverse NLP Tasks

Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether…

Computation and Language · Computer Science 2025-01-22 Qintong Li , Leyang Cui , Lingpeng Kong , Wei Bi

How to Correctly Report LLM-as-a-Judge Evaluations

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple…

Machine Learning · Computer Science 2026-02-10 Chungpa Lee , Thomas Zeng , Jongwon Jeong , Jy-yong Sohn , Kangwook Lee

Judging with Confidence: Calibrating Autoraters to Preference Distributions

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete…

Computation and Language · Computer Science 2025-10-02 Zhuohang Li , Xiaowei Li , Chengyu Huang , Guowang Li , Katayoon Goshvadi , Bo Dai , Dale Schuurmans , Paul Zhou , Hamid Palangi , Yiwen Song , Palash Goyal , Murat Kantarcioglu , Bradley A. Malin , Yuan Xue

Auto-PRE: An Automatic and Cost-Efficient Peer-Review Framework for Language Generation Evaluation

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats,…

Computation and Language · Computer Science 2025-11-11 Junjie Chen , Weihang Su , Zhumin Chu , Haitao Li , Yujia Zhou , Dingbo Yuan , Xudong Wang , Jun Zhou , Yiqun Liu , Min Zhang , Shaoping Ma , Qingyao Ai

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted…

Computation and Language · Computer Science 2025-01-03 Helia Hashemi , Jason Eisner , Corby Rosset , Benjamin Van Durme , Chris Kedzie

Human-Instruction-Free LLM Self-Alignment with Limited Samples

Aligning large language models (LLMs) with human values is a vital task for LLM practitioners. Current alignment techniques have several limitations: (1) requiring a large amount of annotated data; (2) demanding heavy human involvement; (3)…

Computation and Language · Computer Science 2024-01-17 Hongyi Guo , Yuanshun Yao , Wei Shen , Jiaheng Wei , Xiaoying Zhang , Zhaoran Wang , Yang Liu

Benchmarking Cognitive Biases in Large Language Models as Evaluators

Large Language Models are cognitively biased judges. Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four…

Computation and Language · Computer Science 2024-09-26 Ryan Koo , Minhwa Lee , Vipul Raheja , Jong Inn Park , Zae Myung Kim , Dongyeop Kang

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts…

Computation and Language · Computer Science 2024-02-27 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

Calibrating Long-form Generations from Large Language Models

To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods…

Computation and Language · Computer Science 2024-10-29 Yukun Huang , Yixin Liu , Raghuveer Thirukovalluru , Arman Cohan , Bhuwan Dhingra

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing…

Computation and Language · Computer Science 2024-02-14 Rishav Hada , Varun Gumma , Adrian de Wynter , Harshita Diddee , Mohamed Ahmed , Monojit Choudhury , Kalika Bali , Sunayana Sitaram

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent…

Computation and Language · Computer Science 2025-01-20 Yinhong Liu , Han Zhou , Zhijiang Guo , Ehsan Shareghi , Ivan Vulić , Anna Korhonen , Nigel Collier

Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task…

Computation and Language · Computer Science 2025-04-07 Liangjie Huang , Dawei Li , Huan Liu , Lu Cheng

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel…

Methodology · Statistics 2026-03-17 Jia Liu , Zhiyu Xu , Yuqi Gu

Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications

Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for…

Computation and Language · Computer Science 2025-04-07 Hongliu Cao , Ilias Driouich , Robin Singh , Eoin Thomas