Related papers: HumanRankEval: Automatic Evaluation of LMs as Conv…

HREF: Human Response-Guided Evaluation of Instruction Following in Language Models

Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate…

Computation and Language · Computer Science 2025-03-26 Xinxi Lyu , Yizhong Wang , Hannaneh Hajishirzi , Pradeep Dasigi

Bridging HCI and AI Research for the Evaluation of Conversational SE Assistants

As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional…

Software Engineering · Computer Science 2025-02-13 Jonan Richards , Mairieli Wessel

HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting…

Computation and Language · Computer Science 2025-04-11 Mingxuan Li , Hanchen Li , Chenhao Tan

HAL: Inducing Human-likeness in LLMs with Alignment

Conversational human-likeness plays a central role in human-AI interaction, yet it has remained difficult to define, measure, and optimize. As a result, improvements in human-like behavior are largely driven by scale or broad supervised…

Artificial Intelligence · Computer Science 2026-01-08 Masum Hasan , Junjie Zhao , Ehsan Hoque

Multi-modal, Multi-task, Multi-criteria Automatic Evaluation with Vision Language Models

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task,…

Computation and Language · Computer Science 2026-03-10 Masanari Ohi , Masahiro Kaneko , Naoaki Okazaki , Nakamasa Inoue

HD-Eval: Aligning Large Language Model Evaluators Through Hierarchical Criteria Decomposition

Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts…

Computation and Language · Computer Science 2024-02-27 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

LLMEval: A Preliminary Study on How to Evaluate Large Language Models

Recently, the evaluation of Large Language Models has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first…

Artificial Intelligence · Computer Science 2023-12-19 Yue Zhang , Ming Zhang , Haipeng Yuan , Shichun Liu , Yongyao Shi , Tao Gui , Qi Zhang , Xuanjing Huang

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1,…

Computation and Language · Computer Science 2025-11-12 Sher Badshah , Hassan Sajjad

Calibrating LLM-Based Evaluator

Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation.…

Computation and Language · Computer Science 2023-09-26 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a…

Computation and Language · Computer Science 2026-02-11 Nalin Srun , Parisa Rastin , Guénaël Cabanes , Lydia Boudjeloud Assala

Aligning Black-box Language Models with Human Judgments

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs…

Computation and Language · Computer Science 2025-02-10 Gerrit J. J. van den Burg , Gen Suzuki , Wei Liu , Murat Sensoy

On Evaluating LLM Alignment by Evaluating LLMs as Judges

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves…

Computation and Language · Computer Science 2025-11-26 Yixin Liu , Pengfei Liu , Arman Cohan

RepEval: Effective Text Evaluation with LLM Representation

The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text…

Computation and Language · Computer Science 2024-10-29 Shuqian Sheng , Yi Xu , Tianhang Zhang , Zanwei Shen , Luoyi Fu , Jiaxin Ding , Lei Zhou , Xiaoying Gan , Xinbing Wang , Chenghu Zhou

Automatic Evaluation of Generative Models with Instruction Tuning

Automatic evaluation of natural language generation has long been an elusive goal in NLP.A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the…

Computation and Language · Computer Science 2023-11-01 Shuhaib Mehri , Vered Shwartz

TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs

Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation…

Computation and Language · Computer Science 2023-11-10 Shuyi Xie , Wenlin Yao , Yong Dai , Shaobo Wang , Donlin Zhou , Lifeng Jin , Xinhua Feng , Pengzhi Wei , Yujie Lin , Zhichao Hu , Dong Yu , Zhengyou Zhang , Jing Nie , Yuhong Liu

How Trustworthy Are LLM-as-Judge Ratings for Interpretive Responses? Implications for Qualitative Research Workflows

As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive…

Computation and Language · Computer Science 2026-04-02 Songhee Han , Jueun Shin , Jiyoon Han , Bung-Woo Jun , Hilal Ayan Karabatman

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly…

Human-Computer Interaction · Computer Science 2025-08-07 Zahra Ashktorab , Michael Desmond , Qian Pan , James M. Johnson , Martin Santillan Cooper , Elizabeth M. Daly , Rahul Nair , Tejaswini Pedapati , Hyo Jin Do , Werner Geyer

Can Large Language Models Be an Alternative to Human Evaluations?

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable,…

Computation and Language · Computer Science 2023-05-04 Cheng-Han Chiang , Hung-yi Lee

A Comprehensive Analysis of the Effectiveness of Large Language Models as Automatic Dialogue Evaluators

Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique,…

Computation and Language · Computer Science 2024-01-23 Chen Zhang , Luis Fernando D'Haro , Yiming Chen , Malu Zhang , Haizhou Li

Assessing the Performance of Human-Capable LLMs -- Are LLMs Coming for Your Job?

The current paper presents the development and validation of SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. Given the…

Computers and Society · Computer Science 2024-10-23 John Mavi , Nathan Summers , Sergio Coronado