Related papers: Designing generalisation evaluation function throu…

Objective Function Designing Led by User Preferences Acquisition

Many real world problems can be defined as optimisation problems in which the aim is to maximise an objective function. The quality of obtained solution is directly linked to the pertinence of the used objective function. However, designing…

Machine Learning · Computer Science 2012-04-24 Patrick Taillandier , Julien Gaffuri

On the Use of Linguistic Features for the Evaluation of Generative Dialogue Systems

Automatically evaluating text-based, non-task-oriented dialogue systems (i.e., `chatbots') remains an open problem. Previous approaches have suffered challenges ranging from poor correlation with human judgment to poor generalization and…

Computation and Language · Computer Science 2021-04-14 Ian Berlot-Attwell , Frank Rudzicz

How to Evaluate the Next System: Automatic Dialogue Evaluation from the Perspective of Continual Learning

Automatic dialogue evaluation plays a crucial role in open-domain dialogue research. Previous works train neural networks with limited annotation for conducting automatic dialogue evaluation, which would naturally affect the evaluation…

Computation and Language · Computer Science 2019-12-11 Lu Li , Zhongheng He , Xiangyang Zhou , Dianhai Yu

Achieving Reliable Human Assessment of Open-Domain Dialogue Systems

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in…

Computation and Language · Computer Science 2022-03-14 Tianbo Ji , Yvette Graham , Gareth J. F. Jones , Chenyang Lyu , Qun Liu

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly…

Human-Computer Interaction · Computer Science 2025-08-07 Zahra Ashktorab , Michael Desmond , Qian Pan , James M. Johnson , Martin Santillan Cooper , Elizabeth M. Daly , Rahul Nair , Tejaswini Pedapati , Hyo Jin Do , Werner Geyer

Do Large Language Models Perform the Way People Expect? Measuring the Human Generalization Function

What makes large language models (LLMs) impressive is also what makes them hard to evaluate: their diversity of uses. To evaluate these models, we must understand the purposes they will be used for. We consider a setting where these…

Computation and Language · Computer Science 2024-06-04 Keyon Vafa , Ashesh Rambachan , Sendhil Mullainathan

Aligning Generalisation Between Humans and Machines

Recent advances in AI -- including generative approaches -- have resulted in technology that can support humans in scientific discovery and forming decisions, but may also disrupt democracies and target individuals. The responsible use of…

Artificial Intelligence · Computer Science 2025-05-28 Filip Ilievski , Barbara Hammer , Frank van Harmelen , Benjamin Paassen , Sascha Saralajew , Ute Schmid , Michael Biehl , Marianna Bolognesi , Xin Luna Dong , Kiril Gashteovski , Pascal Hitzler , Giuseppe Marra , Pasquale Minervini , Martin Mundt , Axel-Cyrille Ngonga Ngomo , Alessandro Oltramari , Gabriella Pasi , Zeynep G. Saribatur , Luciano Serafini , John Shawe-Taylor , Vered Shwartz , Gabriella Skitalinskaya , Clemens Stachl , Gido M. van de Ven , Thomas Villmann

Investigating Evaluation of Open-Domain Dialogue Systems With Human Generated Multiple References

The aim of this paper is to mitigate the shortcomings of automatic evaluation of open-domain dialog systems through multi-reference evaluation. Existing metrics have been shown to correlate poorly with human judgement, particularly in…

Computation and Language · Computer Science 2019-09-10 Prakhar Gupta , Shikib Mehri , Tiancheng Zhao , Amy Pavel , Maxine Eskenazi , Jeffrey P. Bigham

How to Evaluate Your Dialogue Models: A Review of Approaches

Evaluating the quality of a dialogue system is an understudied problem. The recent evolution of evaluation method motivated this survey, in which an explicit and comprehensive analysis of the existing methods is sought. We are first to…

Computation and Language · Computer Science 2021-08-04 Xinmeng Li , Wansen Wu , Long Qin , Quanjun Yin

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response…

Computation and Language · Computer Science 2018-01-18 Ryan Lowe , Michael Noseworthy , Iulian V. Serban , Nicolas Angelard-Gontier , Yoshua Bengio , Joelle Pineau

Machine Generalization and Human Categorization: An Information-Theoretic View

In designing an intelligent system that must be able to explain its reasoning to a human user, or to provide generalizations that the human user finds reasonable, it may be useful to take into consideration psychological data on what types…

Artificial Intelligence · Computer Science 2013-04-15 James E. Corter , Mark A. Gluck

Survey on Evaluation Methods for Dialogue Systems

In this paper we survey the methods and concepts developed for the evaluation of dialogue systems. Evaluation is a crucial part during the development process. Often, dialogue systems are evaluated by means of human evaluations and…

Computation and Language · Computer Science 2020-06-29 Jan Deriu , Alvaro Rodrigo , Arantxa Otegi , Guillermo Echegoyen , Sophie Rosset , Eneko Agirre , Mark Cieliebak

Human or Machine: Automating Human Likeliness Evaluation of NLG Texts

Automatic evaluation of various text quality criteria produced by data-driven intelligent methods is very common and useful because it is cheap, fast, and usually yields repeatable results. In this paper, we present an attempt to automate…

Computation and Language · Computer Science 2020-06-08 Erion Çano , Ondřej Bojar

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these…

Artificial Intelligence · Computer Science 2020-06-26 Jan Deriu , Mark Cieliebak

Designing Precise and Robust Dialogue Response Evaluators

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In…

Computation and Language · Computer Science 2020-04-27 Tianyu Zhao , Divesh Lala , Tatsuya Kawahara

Ask Your Humans: Using Human Instructions to Improve Generalization in Reinforcement Learning

Complex, multi-task problems have proven to be difficult to solve efficiently in a sparse-reward reinforcement learning setting. In order to be sample efficient, multi-task learning requires reuse and sharing of low-level policies. To…

Machine Learning · Computer Science 2021-09-28 Valerie Chen , Abhinav Gupta , Kenneth Marino

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Though generative dialogue modeling is widely seen as a language modeling task, the task demands an agent to have a complex natural language understanding of its input text to carry a meaningful interaction with an user. The automatic…

Computation and Language · Computer Science 2020-08-25 Prasanna Parthasarathi , Joelle Pineau , Sarath Chandar

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

A major challenge in the field of Text Generation is evaluation: Human evaluations are cost-intensive, and automated metrics often display considerable disagreement with human judgments. In this paper, we propose a statistical model of Text…

Computation and Language · Computer Science 2023-06-07 Jan Deriu , Pius von Däniken , Don Tuggener , Mark Cieliebak

A Critical Look at Meta-evaluating Summarisation Evaluation Metrics

Effective summarisation evaluation metrics enable researchers and practitioners to compare different summarisation systems efficiently. Estimating the effectiveness of an automatic evaluation metric, termed meta-evaluation, is a critically…

Computation and Language · Computer Science 2024-10-01 Xiang Dai , Sarvnaz Karimi , Biaoyan Fang

Fine-Tuning Language Models Using Formal Methods Feedback

Although pre-trained language models encode generic knowledge beneficial for planning and control, they may fail to generate appropriate control policies for domain-specific tasks. Existing fine-tuning methods use human feedback to address…

Artificial Intelligence · Computer Science 2024-04-02 Yunhao Yang , Neel P. Bhatt , Tyler Ingebrand , William Ward , Steven Carr , Zhangyang Wang , Ufuk Topcu