Related papers: FlowEval: A Consensus-Based Dialogue Evaluation Fr…

DynaEval: Unifying Turn and Dialogue Level Evaluation

A dialogue is essentially a multi-turn interaction among interlocutors. Effective evaluation metrics should reflect the dynamics of such interaction. Existing automatic metrics are focused very much on the turn-level quality, while ignoring…

Computation and Language · Computer Science 2021-06-08 Chen Zhang , Yiming Chen , Luis Fernando D'Haro , Yan Zhang , Thomas Friedrichs , Grandee Lee , Haizhou Li

PairEval: Open-domain Dialogue Evaluation with Pairwise Comparison

Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to…

Computation and Language · Computer Science 2024-07-19 ChaeHun Park , Minseok Choi , Dohyun Lee , Jaegul Choo

Dialogue Coherence Assessment Without Explicit Dialogue Act Labels

Recent dialogue coherence models use the coherence features designed for monologue texts, e.g. nominal entities, to represent utterances and then explicitly augment them with dialogue-relevant features, e.g., dialogue act labels. It…

Computation and Language · Computer Science 2020-06-04 Mohsen Mesgar , Sebastian Bücker , Iryna Gurevych

FlowEval: Reference-based Evaluation of Generated User Interfaces

While large language models (LLMs) and coding agents are often applied to user interface (UI) development, developers find it difficult to reliably assess their proficiency in visual and interaction design. Existing evaluations either rely…

Multiagent Systems · Computer Science 2026-05-07 Jason Wu , Priyan Vaithilingam , Eldon Schoop , Jeffrey Nichols , Titus Barik

Conversations Are Not Flat: Modeling the Dynamic Information Flow across Dialogue Utterances

Nowadays, open-domain dialogue models can generate acceptable responses according to the historical context based on the large-scale pre-trained language models. However, they generally concatenate the dialogue history directly as the model…

Computation and Language · Computer Science 2021-06-07 Zekang Li , Jinchao Zhang , Zhengcong Fei , Yang Feng , Jie Zhou

FineD-Eval: Fine-grained Automatic Dialogue-Level Evaluation

Recent model-based reference-free metrics for open-domain dialogue evaluation exhibit promising correlations with human judgment. However, they either perform turn-level evaluation or look at a single dialogue quality dimension. One would…

Computation and Language · Computer Science 2022-11-01 Chen Zhang , Luis Fernando D'Haro , Qiquan Zhang , Thomas Friedrichs , Haizhou Li

FlowDelta: Modeling Flow Information Gain in Reasoning for Conversational Machine Comprehension

Conversational machine comprehension requires deep understanding of the dialogue flow, and the prior work proposed FlowQA to implicitly model the context representations in reasoning for better understanding. This paper proposes to…

Computation and Language · Computer Science 2020-01-20 Yi-Ting Yeh , Yun-Nung Chen

Learning an Unreferenced Metric for Online Dialogue Evaluation

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them…

Computation and Language · Computer Science 2020-05-05 Koustuv Sinha , Prasanna Parthasarathi , Jasmine Wang , Ryan Lowe , William L. Hamilton , Joelle Pineau

SelF-Eval: Self-supervised Fine-grained Dialogue Evaluation

This paper introduces a novel Self-supervised Fine-grained Dialogue Evaluation framework (SelF-Eval). The core idea is to model the correlation between turn quality and the entire dialogue quality. We first propose a novel automatic data…

Computation and Language · Computer Science 2022-09-19 Longxuan Ma , Ziyu Zhuang , Weinan Zhang , Mingda Li , Ting Liu

Open-Domain Dialogue Quality Evaluation: Deriving Nugget-level Scores from Turn-level Scores

Existing dialogue quality evaluation systems can return a score for a given system turn from a particular viewpoint, e.g., engagingness. However, to improve dialogue systems by locating exactly where in a system turn potential problems lie,…

Computation and Language · Computer Science 2023-10-03 Rikiya Takehi , Akihisa Watanabe , Tetsuya Sakai

SLUE: New Benchmark Tasks for Spoken Language Understanding Evaluation on Natural Speech

Progress in speech processing has been facilitated by shared datasets and benchmarks. Historically these have focused on automatic speech recognition (ASR), speaker identification, or other lower-level tasks. Interest has been growing in…

Computation and Language · Computer Science 2022-08-01 Suwon Shon , Ankita Pasad , Felix Wu , Pablo Brusco , Yoav Artzi , Karen Livescu , Kyu J. Han

Measuring Conversational Fluidity in Automated Dialogue Agents

We present an automated evaluation method to measure fluidity in conversational dialogue systems. The method combines various state of the art Natural Language tools into a classifier, and human ratings on these dialogues to train an…

Computation and Language · Computer Science 2019-10-28 Keith Vella , Massimo Poesio , Michael Sigamani , Cihan Dogan , Aimore Dutra , Dimitrios Dimakopoulos , Alfredo Gemma , Ella Walters

DiQAD: A Benchmark Dataset for End-to-End Open-domain Dialogue Assessment

Dialogue assessment plays a critical role in the development of open-domain dialogue systems. Existing work are uncapable of providing an end-to-end and human-epistemic assessment dataset, while they only provide sub-metrics like coherence…

Computation and Language · Computer Science 2023-10-26 Yukun Zhao , Lingyong Yan , Weiwei Sun , Chong Meng , Shuaiqiang Wang , Zhicong Cheng , Zhaochun Ren , Dawei Yin

DFEE: Interactive DataFlow Execution and Evaluation Kit

DataFlow has been emerging as a new paradigm for building task-oriented chatbots due to its expressive semantic representations of the dialogue tasks. Despite the availability of a large dataset SMCalFlow and a simplified syntax, the…

Computation and Language · Computer Science 2022-12-19 Han He , Song Feng , Daniele Bonadiman , Yi Zhang , Saab Mansour

ACCENT: An Automatic Event Commonsense Evaluation Metric for Open-Domain Dialogue Systems

Commonsense reasoning is omnipresent in human communications and thus is an important feature for open-domain dialogue systems. However, evaluating commonsense in dialogue systems is still an open challenge. We take the first step by…

Computation and Language · Computer Science 2023-11-06 Sarik Ghazarian , Yijia Shao , Rujun Han , Aram Galstyan , Nanyun Peng

User Response and Sentiment Prediction for Automatic Dialogue Evaluation

Automatic evaluation is beneficial for open-domain dialog system development. However, standard word-overlap metrics (BLEU, ROUGE) do not correlate well with human judgements of open-domain dialog systems. In this work we propose to use the…

Computation and Language · Computer Science 2022-02-18 Sarik Ghazarian , Behnam Hedayatnia , Alexandros Papangelis , Yang Liu , Dilek Hakkani-Tur

Improving Dialogue Act Classification for Spontaneous Arabic Speech and Instant Messages at Utterance Level

The ability to model and automatically detect dialogue act is an important step toward understanding spontaneous speech and Instant Messages. However, it has been difficult to infer a dialogue act from a surface utterance because it highly…

Computation and Language · Computer Science 2018-06-05 AbdelRahim Elmadany , Sherif Abdou , Mervat Gheith

What is wrong with you?: Leveraging User Sentiment for Automatic Dialog Evaluation

Accurate automatic evaluation metrics for open-domain dialogs are in high demand. Existing model-based metrics for system response evaluation are trained on human annotated data, which is cumbersome to collect. In this work, we propose to…

Computation and Language · Computer Science 2022-03-29 Sarik Ghazarian , Behnam Hedayatnia , Alexandros Papangelis , Yang Liu , Dilek Hakkani-Tur

Tools as Continuous Flow for Evolving Agentic Reasoning

Large Language Models (LLMs) have demonstrated remarkable capabilities in orchestrating tools for reasoning tasks. However, existing methods rely on a step-wise paradigm that lacks a global perspective, which causes error accumulation over…

Artificial Intelligence · Computer Science 2026-05-11 Tairan Huang , Siyu Shang , Qiang Chen , Xiu Su , Yi Chen

How to Evaluate Your Dialogue Models: A Review of Approaches

Evaluating the quality of a dialogue system is an understudied problem. The recent evolution of evaluation method motivated this survey, in which an explicit and comprehensive analysis of the existing methods is sought. We are first to…

Computation and Language · Computer Science 2021-08-04 Xinmeng Li , Wansen Wu , Long Qin , Quanjun Yin