English
Related papers

Related papers: HumanRankEval: Automatic Evaluation of LMs as Conv…

200 papers

Evaluating the capability of Large Language Models (LLMs) in following instructions has heavily relied on a powerful LLM as the judge, introducing unresolved biases that deviate the judgments from human judges. In this work, we reevaluate…

Computation and Language · Computer Science 2025-03-26 Xinxi Lyu , Yizhong Wang , Hannaneh Hajishirzi , Pradeep Dasigi

As Large Language Models (LLMs) are increasingly adopted in software engineering, recently in the form of conversational assistants, ensuring these technologies align with developers' needs is essential. The limitations of traditional…

Software Engineering · Computer Science 2025-02-13 Jonan Richards , Mairieli Wessel

Large language models (LLMs) have demonstrated great potential for automating the evaluation of natural language generation. Previous frameworks of LLM-as-a-judge fall short in two ways: they either use zero-shot setting without consulting…

Computation and Language · Computer Science 2025-04-11 Mingxuan Li , Hanchen Li , Chenhao Tan

Conversational human-likeness plays a central role in human-AI interaction, yet it has remained difficult to define, measure, and optimize. As a result, improvements in human-like behavior are largely driven by scale or broad supervised…

Artificial Intelligence · Computer Science 2026-01-08 Masum Hasan , Junjie Zhao , Ehsan Hoque

Vision-language models (VLMs) have shown impressive abilities across a range of multi-modal tasks. However, existing metrics for evaluating the quality of text generated by VLMs typically focus on an overall evaluation for a specific task,…

Computation and Language · Computer Science 2026-03-10 Masanari Ohi , Masahiro Kaneko , Naoaki Okazaki , Nakamasa Inoue

Large language models (LLMs) have emerged as a promising alternative to expensive human evaluations. However, the alignment and coverage of LLM-based evaluations are often limited by the scope and potential bias of the evaluation prompts…

Computation and Language · Computer Science 2024-02-27 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

Recently, the evaluation of Large Language Models has emerged as a popular area of research. The three crucial questions for LLM evaluation are ``what, where, and how to evaluate''. However, the existing research mainly focuses on the first…

Artificial Intelligence · Computer Science 2023-12-19 Yue Zhang , Ming Zhang , Haipeng Yuan , Shichun Liu , Yongyao Shi , Tao Gui , Qi Zhang , Xuanjing Huang

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1,…

Computation and Language · Computer Science 2025-11-12 Sher Badshah , Hassan Sajjad

Recent advancements in large language models (LLMs) on language modeling and emergent capabilities make them a promising reference-free evaluator of natural language generation quality, and a competent alternative to human evaluation.…

Computation and Language · Computer Science 2023-09-26 Yuxuan Liu , Tianchi Yang , Shaohan Huang , Zihan Zhang , Haizhen Huang , Furu Wei , Weiwei Deng , Feng Sun , Qi Zhang

We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a…

Computation and Language · Computer Science 2026-02-11 Nalin Srun , Parisa Rastin , Guénaël Cabanes , Lydia Boudjeloud Assala

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs…

Computation and Language · Computer Science 2025-02-10 Gerrit J. J. van den Burg , Gen Suzuki , Wei Liu , Murat Sensoy

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves…

Computation and Language · Computer Science 2025-11-26 Yixin Liu , Pengfei Liu , Arman Cohan

The era of Large Language Models (LLMs) raises new demands for automatic evaluation metrics, which should be adaptable to various application scenarios while maintaining low cost and effectiveness. Traditional metrics for automatic text…

Computation and Language · Computer Science 2024-10-29 Shuqian Sheng , Yi Xu , Tianhang Zhang , Zanwei Shen , Luoyi Fu , Jiaxin Ding , Lei Zhou , Xiaoying Gan , Xinbing Wang , Chenghu Zhou

Automatic evaluation of natural language generation has long been an elusive goal in NLP.A recent paradigm fine-tunes pre-trained language models to emulate human judgements for a particular task and evaluation criterion. Inspired by the…

Computation and Language · Computer Science 2023-11-01 Shuhaib Mehri , Vered Shwartz

Large language models (LLMs) have shown impressive capabilities across various natural language tasks. However, evaluating their alignment with human preferences remains a challenge. To this end, we propose a comprehensive human evaluation…

Computation and Language · Computer Science 2023-11-10 Shuyi Xie , Wenlin Yao , Yong Dai , Shaobo Wang , Donlin Zhou , Lifeng Jin , Xinhua Feng , Pengzhi Wei , Yujie Lin , Zhichao Hu , Dong Yu , Zhengyou Zhang , Jing Nie , Yuhong Liu

As qualitative researchers show growing interest in using automated tools to support interpretive analysis, a large language model (LLM) is often introduced into an analytic workflow as is, without systematic evaluation of interpretive…

Computation and Language · Computer Science 2026-04-02 Songhee Han , Jueun Shin , Jiyoon Han , Bung-Woo Jun , Hilal Ayan Karabatman

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly…

Human evaluation is indispensable and inevitable for assessing the quality of texts generated by machine learning models or written by humans. However, human evaluation is very difficult to reproduce and its quality is notoriously unstable,…

Computation and Language · Computer Science 2023-05-04 Cheng-Han Chiang , Hung-yi Lee

Automatic evaluation is an integral aspect of dialogue system research. The traditional reference-based NLG metrics are generally found to be unsuitable for dialogue assessment. Consequently, recent studies have suggested various unique,…

Computation and Language · Computer Science 2024-01-23 Chen Zhang , Luis Fernando D'Haro , Yiming Chen , Malu Zhang , Haizhou Li

The current paper presents the development and validation of SelfScore, a novel benchmark designed to assess the performance of automated Large Language Model (LLM) agents on help desk and professional consultation tasks. Given the…

Computers and Society · Computer Science 2024-10-23 John Mavi , Nathan Summers , Sergio Coronado
‹ Prev 1 2 3 10 Next ›