Evaluating Open-QA Evaluation

Cunxiang Wang; Sirui Cheng; Qipeng Guo; Yuanhao Yue; Bowen Ding; Zhikun Xu; Yidong Wang; Xiangkun Hu; Zheng Zhang; Yue Zhang

Evaluating Open-QA Evaluation

Computation and Language 2023-10-24 v4 Artificial Intelligence

Authors: Cunxiang Wang , Sirui Cheng , Qipeng Guo , Yuanhao Yue , Bowen Ding , Zhikun Xu , Yidong Wang , Xiangkun Hu , Zheng Zhang , Yue Zhang

View on arXiv ↗ PDF ↗

Abstract

This study focuses on the evaluation of the Open Question Answering (Open-QA) task, which can directly estimate the factuality of large language models (LLMs). Current automatic evaluation methods have shown limitations, indicating that human evaluation still remains the most reliable approach. We introduce a new task, Evaluating QA Evaluation (QA-Eval) and the corresponding dataset EVOUNA, designed to assess the accuracy of AI-generated answers in relation to standard answers within Open-QA. Our evaluation of these methods utilizes human-annotated results to measure their performance. Specifically, the work investigates methods that show high correlation with human evaluations, deeming them more reliable. We also discuss the pitfalls of current methods and methods to improve LLM-based evaluators. We believe this new QA-Eval task and corresponding dataset EVOUNA will facilitate the development of more effective automatic evaluation tools and prove valuable for future research in this area. All resources are available at \url{https://github.com/wangcunxiang/QA-Eval} and it is under the Apache-2.0 License.

Keywords

question answering large language model evaluation large language model

Cite

@article{arxiv.2305.12421,
  title  = {Evaluating Open-QA Evaluation},
  author = {Cunxiang Wang and Sirui Cheng and Qipeng Guo and Yuanhao Yue and Bowen Ding and Zhikun Xu and Yidong Wang and Xiangkun Hu and Zheng Zhang and Yue Zhang},
  journal= {arXiv preprint arXiv:2305.12421},
  year   = {2023}
}

Comments

Accepted by Neurips-2023 Datasets and Benchmarks track; 28 pages

Evaluating Open-QA Evaluation

Abstract

Keywords

Cite

Comments

Related papers