Related papers: Evaluating Open-QA Evaluation

Accurate and Nuanced Open-QA Evaluation Through Textual Entailment

Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators.…

Computation and Language · Computer Science 2024-05-28 Peiran Yao , Denilson Barbosa

Improving Automatic VQA Evaluation Using Large Language Models

8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a…

Computer Vision and Pattern Recognition · Computer Science 2024-01-11 Oscar Mañas , Benno Krojer , Aishwarya Agrawal

IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering

To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on assessing single-turn responses to given questions. However, this approach doesn't capture the dynamic nature of human-AI…

Computation and Language · Computer Science 2024-11-19 Ruosen Li , Ruochen Li , Barry Wang , Xinya Du

KNVQA: A Benchmark for evaluation knowledge-based VQA

Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Sirui Cheng , Siyu Zhang , Jiayi Wu , Muchen Lan

An Empirical Study of Evaluating Long-form Question Answering

\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram…

Information Retrieval · Computer Science 2025-04-28 Ning Xian , Yixing Fan , Ruqing Zhang , Maarten de Rijke , Jiafeng Guo

Evaluation of Question Answering Systems: Complexity of judging a natural language

Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP). A reason, therefore, is that a QA system allows humans to interact more naturally with a machine,…

Computation and Language · Computer Science 2022-09-27 Amer Farea , Zhen Yang , Kien Duong , Nadeesha Perera , Frank Emmert-Streib

A Benchmark for Long-Form Medical Question Answering

There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable,…

Computation and Language · Computer Science 2024-11-21 Pedram Hosseini , Jessica M. Sin , Bing Ren , Bryceton G. Thomas , Elnaz Nouri , Ali Farahanchi , Saeed Hassanpour

CFMatch: Aligning Automated Answer Equivalence Evaluation with Expert Judgments For Open-Domain Question Answering

Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with…

Computation and Language · Computer Science 2024-07-02 Zongxia Li , Ishani Mondal , Yijun Liang , Huy Nghiem , Jordan Boyd-Graber

CUS-QA: Local-Knowledge-Oriented Open-Ended Question Answering Dataset

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset…

Computation and Language · Computer Science 2026-02-03 Jindřich Libovický , Jindřich Helcl , Andrei Manea , Gianluca Vico

AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses

Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or…

Computation and Language · Computer Science 2024-10-03 Xiaotian Lu , Jiyi Li , Koh Takeuchi , Hisashi Kashima

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1,…

Computation and Language · Computer Science 2025-11-12 Sher Badshah , Hassan Sajjad

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to…

Computation and Language · Computer Science 2025-06-10 Francesco Maria Molfese , Luca Moroni , Luca Gioffré , Alessandro Scirè , Simone Conia , Roberto Navigli

MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs

Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions,…

Computation and Language · Computer Science 2025-06-19 Yongqi Fan , Yating Wang , Guandong Wang , Jie Zhai , Jingping Liu , Qi Ye , Tong Ruan

Evaluating Open-Domain Question Answering in the Era of Large Language Models

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is…

Computation and Language · Computer Science 2023-07-10 Ehsan Kamalloo , Nouha Dziri , Charles L. A. Clarke , Davood Rafiei

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current…

Computation and Language · Computer Science 2024-10-15 Zongxia Li , Ishani Mondal , Yijun Liang , Huy Nghiem , Jordan Lee Boyd-Graber

Suvach -- Generated Hindi QA benchmark

Current evaluation benchmarks for question answering (QA) in Indic languages often rely on machine translation of existing English datasets. This approach suffers from bias and inaccuracies inherent in machine translation, leading to…

Computation and Language · Computer Science 2024-05-01 Vaishak Narayanan , Prabin Raj KP , Saifudheen Nouphal

EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta

Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning…

Computation and Language · Computer Science 2025-01-03 Raymond Bernard , Shaina Raza , Subhabrata Das , Rahul Murugan

EfficientEQA: An Efficient Approach to Open-Vocabulary Embodied Question Answering

Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without…

Robotics · Computer Science 2025-08-12 Kai Cheng , Zhengyuan Li , Xingpeng Sun , Byung-Cheol Min , Amrit Singh Bedi , Aniket Bera

LFQA-E: Carefully Benchmarking Long-form QA Evaluation

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format.…

Computation and Language · Computer Science 2026-02-03 Yuchen Fan , Chen Lin , Xin Zhong , Shuo Zhang , Heng Zhou , Yuchen Zhang , Mingyu Liang , Chengxing Xie , Ermo Hua , Gang Chen , Zhizhou He , Cheng Huang , Ning Ding , Bowen Zhou

Factuality of Large Language Models: A Survey

Large language models (LLMs), especially when instruction-tuned for chat, have become part of our daily lives, freeing people from the process of searching, extracting, and integrating information from multiple sources by offering a…

Computation and Language · Computer Science 2024-11-01 Yuxia Wang , Minghan Wang , Muhammad Arslan Manzoor , Fei Liu , Georgi Georgiev , Rocktim Jyoti Das , Preslav Nakov