English
Related papers

Related papers: Evaluating Open-QA Evaluation

200 papers

Open-domain question answering (Open-QA) is a common task for evaluating large language models (LLMs). However, current Open-QA evaluations are criticized for the ambiguity in questions and the lack of semantic understanding in evaluators.…

Computation and Language · Computer Science 2024-05-28 Peiran Yao , Denilson Barbosa

8 years after the visual question answering (VQA) task was proposed, accuracy remains the primary metric for automatic evaluation. VQA Accuracy has been effective so far in the IID evaluation setting. However, our community is undergoing a…

Computer Vision and Pattern Recognition · Computer Science 2024-01-11 Oscar Mañas , Benno Krojer , Aishwarya Agrawal

To evaluate Large Language Models (LLMs) for question answering (QA), traditional methods typically focus on assessing single-turn responses to given questions. However, this approach doesn't capture the dynamic nature of human-AI…

Computation and Language · Computer Science 2024-11-19 Ruosen Li , Ruochen Li , Barry Wang , Xinya Du

Within the multimodal field, large vision-language models (LVLMs) have made significant progress due to their strong perception and reasoning capabilities in the visual and language systems. However, LVLMs are still plagued by the two…

Computer Vision and Pattern Recognition · Computer Science 2024-06-14 Sirui Cheng , Siyu Zhang , Jiayi Wu , Muchen Lan

\Ac{LFQA} aims to generate lengthy answers to complex questions. This scenario presents great flexibility as well as significant challenges for evaluation. Most evaluations rely on deterministic metrics that depend on string or n-gram…

Information Retrieval · Computer Science 2025-04-28 Ning Xian , Yixing Fan , Ruqing Zhang , Maarten de Rijke , Jiafeng Guo

Question answering (QA) systems are among the most important and rapidly developing research topics in natural language processing (NLP). A reason, therefore, is that a QA system allows humans to interact more naturally with a machine,…

Computation and Language · Computer Science 2022-09-27 Amer Farea , Zhen Yang , Kien Duong , Nadeesha Perera , Frank Emmert-Streib

There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. While valuable,…

Computation and Language · Computer Science 2024-11-21 Pedram Hosseini , Jessica M. Sin , Bing Ren , Bryceton G. Thomas , Elnaz Nouri , Ali Farahanchi , Saeed Hassanpour

Question answering (QA) can only make progress if we know if an answer is correct, but for many of the most challenging and interesting QA examples, current evaluation metrics to determine answer equivalence (AE) often do not align with…

Computation and Language · Computer Science 2024-07-02 Zongxia Li , Ishani Mondal , Yijun Liang , Huy Nghiem , Jordan Boyd-Graber

We introduce CUS-QA, a benchmark for evaluation of open-ended regional question answering that encompasses both textual and visual modalities. We also provide strong baselines using state-of-the-art large language models (LLMs). Our dataset…

Computation and Language · Computer Science 2026-02-03 Jindřich Libovický , Jindřich Helcl , Andrei Manea , Gianluca Vico

Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or…

Computation and Language · Computer Science 2024-10-03 Xiaotian Lu , Jiyi Li , Koh Takeuchi , Hisashi Kashima

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1,…

Computation and Language · Computer Science 2025-11-12 Sher Badshah , Hassan Sajjad

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to…

Computation and Language · Computer Science 2025-06-10 Francesco Maria Molfese , Luca Moroni , Luca Gioffré , Alessandro Scirè , Simone Conia , Roberto Navigli

Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs). Compared to closed-ended QA, it demands longer answer statements, more nuanced reasoning processes, and diverse expressions,…

Computation and Language · Computer Science 2025-06-19 Yongqi Fan , Yating Wang , Guandong Wang , Jie Zhai , Jingping Liu , Qi Ye , Tong Ruan

Lexical matching remains the de facto evaluation method for open-domain question answering (QA). Unfortunately, lexical matching fails completely when a plausible candidate answer does not appear in the list of gold answers, which is…

Computation and Language · Computer Science 2023-07-10 Ehsan Kamalloo , Nouha Dziri , Charles L. A. Clarke , Davood Rafiei

Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current…

Computation and Language · Computer Science 2024-10-15 Zongxia Li , Ishani Mondal , Yijun Liang , Huy Nghiem , Jordan Lee Boyd-Graber

Current evaluation benchmarks for question answering (QA) in Indic languages often rely on machine translation of existing English datasets. This approach suffers from bias and inaccuracies inherent in machine translation, leading to…

Computation and Language · Computer Science 2024-05-01 Vaishak Narayanan , Prabin Raj KP , Saifudheen Nouphal

Despite the remarkable coherence of Large Language Models (LLMs), existing evaluation methods often suffer from fluency bias and rely heavily on multiple-choice formats, making it difficult to assess factual accuracy and complex reasoning…

Computation and Language · Computer Science 2025-01-03 Raymond Bernard , Shaina Raza , Subhabrata Das , Rahul Murugan

Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without…

Robotics · Computer Science 2025-08-12 Kai Cheng , Zhengyuan Li , Xingpeng Sun , Byung-Cheol Min , Amrit Singh Bedi , Aniket Bera

Long-Form Question Answering (LFQA) involves generating comprehensive, paragraph-level responses to open-ended questions, which poses a significant challenge for evaluation due to the richness of information and flexible response format.…

Computation and Language · Computer Science 2026-02-03 Yuchen Fan , Chen Lin , Xin Zhong , Shuo Zhang , Heng Zhou , Yuchen Zhang , Mingyu Liang , Chengxing Xie , Ermo Hua , Gang Chen , Zhizhou He , Cheng Huang , Ning Ding , Bowen Zhou

Large language models (LLMs), especially when instruction-tuned for chat, have become part of our daily lives, freeing people from the process of searching, extracting, and integrating information from multiple sources by offering a…

Computation and Language · Computer Science 2024-11-01 Yuxia Wang , Minghan Wang , Muhammad Arslan Manzoor , Fei Liu , Georgi Georgiev , Rocktim Jyoti Das , Preslav Nakov
‹ Prev 1 2 3 10 Next ›