English
Related papers

Related papers: Calibrating LLM-Based Evaluator

200 papers

Large Language Models (LLMs) show promise for automated grading, but their outputs can be unreliable. Rather than improving grading accuracy directly, we address a complementary problem: \textit{predicting when an LLM grader is likely to be…

Computation and Language · Computer Science 2026-04-01 Robinson Ferrer , Damla Turgut , Zhongzhou Chen , Shashank Sonkar

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs…

Computation and Language · Computer Science 2025-02-10 Gerrit J. J. van den Burg , Gen Suzuki , Wei Liu , Murat Sensoy

In search settings, calibrating the scores during the ranking process to quantities such as click-through rates or relevance levels enhances a system's usefulness and trustworthiness for downstream users. While previous research has…

Information Retrieval · Computer Science 2024-08-28 Puxuan Yu , Daniel Cohen , Hemank Lamba , Joel Tetreault , Alex Jaimes

While large language models (LLMs) have been used for automated grading, they have not yet achieved the same level of performance as humans, especially when it comes to grading complex questions. Existing research on this topic focuses on a…

Artificial Intelligence · Computer Science 2024-05-31 Wenjing Xie , Juxin Niu , Chun Jason Xue , Nan Guan

With the advancements in Large Language Models (LLMs), Vision-Language Models (VLMs) have reached a new level of sophistication, showing notable competence in executing intricate cognition and reasoning tasks. However, existing evaluation…

Computer Vision and Pattern Recognition · Computer Science 2023-11-27 Yuanfeng Ji , Chongjian Ge , Weikai Kong , Enze Xie , Zhengying Liu , Zhengguo Li , Ping Luo

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves…

Computation and Language · Computer Science 2025-11-26 Yixin Liu , Pengfei Liu , Arman Cohan

Previous work adopts large language models (LLMs) as evaluators to evaluate natural language process (NLP) tasks. However, certain shortcomings, e.g., fairness, scope, and accuracy, persist for current LLM evaluators. To analyze whether…

Computation and Language · Computer Science 2025-01-22 Qintong Li , Leyang Cui , Lingpeng Kong , Wei Bi

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple…

Machine Learning · Computer Science 2026-02-10 Chungpa Lee , Thomas Zeng , Jongwon Jeong , Jy-yong Sohn , Kangwook Lee

The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete…

The rapid development of large language models (LLMs) has highlighted the need for efficient and reliable methods to evaluate their performance. Traditional evaluation methods often face challenges like high costs, limited task formats,…

Computation and Language · Computer Science 2025-11-11 Junjie Chen , Weihang Su , Zhumin Chu , Haitao Li , Yujia Zhou , Dingbo Yuan , Xudong Wang , Jun Zhou , Yiqun Liu , Min Zhang , Shaoping Ma , Qingyao Ai

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted…

Computation and Language · Computer Science 2025-01-03 Helia Hashemi , Jason Eisner , Corby Rosset , Benjamin Van Durme , Chris Kedzie

Aligning large language models (LLMs) with human values is a vital task for LLM practitioners. Current alignment techniques have several limitations: (1) requiring a large amount of annotated data; (2) demanding heavy human involvement; (3)…

Computation and Language · Computer Science 2024-01-17 Hongyi Guo , Yuanshun Yao , Wei Shen , Jiaheng Wei , Xiaoying Zhang , Zhaoran Wang , Yang Liu

Large Language Models are cognitively biased judges. Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 15 LLMs of four…

Computation and Language · Computer Science 2024-09-26 Ryan Koo , Minhwa Lee , Vipul Raheja , Jong Inn Park , Zae Myung Kim , Dongyeop Kang

Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts…

Computation and Language · Computer Science 2024-02-27 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

To enhance Large Language Models' (LLMs) reliability, calibration is essential -- the model's assessed confidence scores should align with the actual likelihood of its responses being correct. However, current confidence elicitation methods…

Computation and Language · Computer Science 2024-10-29 Yukun Huang , Yixin Liu , Raghuveer Thirukovalluru , Arman Cohan , Bhuwan Dhingra

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations. Employing…

Computation and Language · Computer Science 2024-02-14 Rishav Hada , Varun Gumma , Adrian de Wynter , Harshita Diddee , Mohamed Ahmed , Monojit Choudhury , Kalika Bali , Sunayana Sitaram

Large Language Models (LLMs) have demonstrated promising capabilities as automatic evaluators in assessing the quality of generated natural language. However, LLMs still exhibit biases in evaluation and often struggle to generate coherent…

Computation and Language · Computer Science 2025-01-20 Yinhong Liu , Han Zhou , Zhijiang Guo , Ehsan Shareghi , Ivan Vulić , Anna Korhonen , Nigel Collier

Large Language Models (LLMs) have demonstrated remarkable self-improvement capabilities, whereby models iteratively revise their outputs through self-generated feedback. While this reflective mechanism has shown promise in enhancing task…

Computation and Language · Computer Science 2025-04-07 Liangjie Huang , Dawei Li , Huan Liu , Lu Cheng

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel…

Methodology · Statistics 2026-03-17 Jia Liu , Zhiyu Xu , Yuqi Gu

Large Language Models (LLMs) have demonstrated impressive performance across diverse domains, yet they still encounter challenges such as insufficient domain-specific knowledge, biases, and hallucinations. This underscores the need for…

Computation and Language · Computer Science 2025-04-07 Hongliu Cao , Ilias Driouich , Robin Singh , Eoin Thomas
‹ Prev 1 2 3 10 Next ›