Related papers: An Automatic Question Usability Evaluation Toolkit

CRACQ: A Multi-Dimensional Approach To Automated Document Assessment

This paper presents CRACQ, a multi-dimensional evaluation framework tailored to evaluate documents across f i v e specific traits: Coherence, Rigor, Appropriateness, Completeness, and Quality. Building on insights from traitbased Automated…

Computation and Language · Computer Science 2025-10-06 Ishak Soltani , Francisco Belo , Bernardo Tavares

AGenT Zero: Zero-shot Automatic Multiple-Choice Question Generation for Skill Assessments

Multiple-choice questions (MCQs) offer the most promising avenue for skill evaluation in the era of virtual education and job recruiting, where traditional performance-based alternatives such as projects and essays have become less viable,…

Computers and Society · Computer Science 2020-12-22 Eric Li , Jingyi Su , Hao Sheng , Lawrence Wai

An Algorithm for Generating Gap-Fill Multiple Choice Questions of an Expert System

This research is aimed to propose an artificial intelligence algorithm comprising an ontology-based design, text mining, and natural language processing for automatically generating gap-fill multiple choice questions (MCQs). The simulation…

Artificial Intelligence · Computer Science 2021-09-24 Pornpat Sirithumgul , Pimpaka Prasertsilp , Lorne Olfman

AI-Assisted Model for Generating Multiple-Choice Questions

Multiple-choice questions (MCQs) are widely used across diverse educational fields and levels. Well-designed MCQs should evaluate knowledge application in real-world situations. However, writing such test items in sufficient numbers is…

Human-Computer Interaction · Computer Science 2026-02-10 Tetiana Krushynska , Jani Ursin , Ville Heilala

Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models

As scientific knowledge grows at an unprecedented pace, evaluation benchmarks must evolve to reflect new discoveries and ensure language models are tested on current, diverse literature. We propose a scalable, modular framework for…

Computation and Language · Computer Science 2025-09-16 Ozan Gokdemir , Neil Getty , Robert Underwood , Sandeep Madireddy , Franck Cappello , Arvind Ramanathan , Ian T. Foster , Rick L. Stevens

SHEET: A Multi-purpose Open-source Speech Human Evaluation Estimation Toolkit

We introduce SHEET, a multi-purpose open-source toolkit designed to accelerate subjective speech quality assessment (SSQA) research. SHEET stands for the Speech Human Evaluation Estimation Toolkit, which focuses on data-driven deep neural…

Sound · Computer Science 2025-05-22 Wen-Chin Huang , Erica Cooper , Tomoki Toda

SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence

We introduce SciEvalKit, a unified benchmarking toolkit designed to evaluate AI models for science across a broad range of scientific disciplines and task capabilities. Unlike general-purpose evaluation platforms, SciEvalKit focuses on the…

Artificial Intelligence · Computer Science 2026-01-07 Yiheng Wang , Yixin Chen , Shuo Li , Yifan Zhou , Bo Liu , Hengjian Gao , Jiakang Yuan , Jia Bu , Wanghan Xu , Yuhao Zhou , Xiangyu Zhao , Zhiwang Zhou , Fengxiang Wang , Haodong Duan , Songyang Zhang , Jun Yao , Han Deng , Yizhou Wang , Jiabei Xiao , Jiaqi Liu , Encheng Su , Yujie Liu , Weida Wang , Junchi Yao , Shenghe Zheng , Haoran Sun , Runmin Ma , Xiangchao Yan , Bo Zhang , Dongzhan Zhou , Shufei Zhang , Peng Ye , Xiaosong Wang , Shixiang Tang , Wenlong Zhang , Lei Bai

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for…

Computation and Language · Computer Science 2024-03-13 Fangyun Wei , Xi Chen , Lin Luo

Orchestrating LLM Agents for Scientific Research: A Pilot Study of Multiple Choice Question (MCQ) Generation and Evaluation

Advances in large language models (LLMs) are rapidly transforming scientific work, yet empirical evidence on how these systems reshape research activities remains limited. We report a mixed-methods pilot evaluation of an AI-orchestrated…

Computers and Society · Computer Science 2026-02-24 Yuan An

Generating AI Literacy MCQs: A Multi-Agent LLM Approach

Artificial intelligence (AI) is transforming society, making it crucial to prepare the next generation through AI literacy in K-12 education. However, scalable and reliable AI literacy materials and assessment resources are lacking. To…

Human-Computer Interaction · Computer Science 2024-12-03 Jiayi Wang , Ruiwei Xiao , Ying-Jui Tseng

Assessing the Quality of Multiple-Choice Questions Using GPT-4 and Rule-Based Methods

Multiple-choice questions with item-writing flaws can negatively impact student learning and skew analytics. These flaws are often present in student-generated questions, making it difficult to assess their quality and suitability for…

Computation and Language · Computer Science 2023-07-18 Steven Moore , Huy A. Nguyen , Tianying Chen , John Stamper

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to…

Computation and Language · Computer Science 2025-06-10 Francesco Maria Molfese , Luca Moroni , Luca Gioffré , Alessandro Scirè , Simone Conia , Roberto Navigli

The Impact of Item-Writing Flaws on Difficulty and Discrimination in Item Response Theory

High-quality test items are essential for educational assessments, particularly within Item Response Theory (IRT). Traditional validation methods rely on resource-intensive pilot testing to estimate item difficulty and discrimination. More…

Computation and Language · Computer Science 2025-08-08 Robin Schmucker , Steven Moore

DeepQR: Neural-based Quality Ratings for Learnersourced Multiple-Choice Questions

Automated question quality rating (AQQR) aims to evaluate question quality through computational means, thereby addressing emerging challenges in online learnersourced question repositories. Existing methods for AQQR rely solely on…

Computation and Language · Computer Science 2021-11-22 Lin Ni , Qiming Bao , Xiaoxuan Li , Qianqian Qi , Paul Denny , Jim Warren , Michael Witbrock , Jiamou Liu

MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback

Automatic question generation (QG) is essential for AI and NLP, particularly in intelligent tutoring, dialogue systems, and fact verification. Generating multiple-choice questions (MCQG) for professional exams, like the United States…

Computation and Language · Computer Science 2025-02-11 Zonghai Yao , Aditya Parashar , Huixue Zhou , Won Seok Jang , Feiyun Ouyang , Zhichao Yang , Hong Yu

Math Multiple Choice Question Generation via Human-Large Language Model Collaboration

Multiple choice questions (MCQs) are a popular method for evaluating students' knowledge due to their efficiency in administration and grading. Crafting high-quality math MCQs is a labor-intensive process that requires educators to…

Computation and Language · Computer Science 2024-05-03 Jaewook Lee , Digory Smith , Simon Woodhead , Andrew Lan

SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References

Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity…

Computation and Language · Computer Science 2023-09-22 Matteo Gabburo , Siddhant Garg , Rik Koncel Kedziorski , Alessandro Moschitti

BenchMarker: An Education-Inspired Toolkit for Highlighting Flaws in Multiple-Choice Benchmarks

Multiple-choice question answering (MCQA) is standard in NLP, but benchmarks lack rigorous quality control. We present BenchMarker, an education-inspired toolkit using LLM judges to flag three common MCQ flaws: 1) contamination: items…

Computation and Language · Computer Science 2026-04-21 Nishant Balepur , Bhavya Rajasekaran , Jane Oh , Michael Xie , Atrey Desai , Vipul Gupta , Steven James Moore , Eunsol Choi , Rachel Rudinger , Jordan Lee Boyd-Graber

ParseIT: A Question-Answer based Tool to Learn Parsing Techniques

Parsing (also called syntax analysis) techniques cover a substantial portion of any undergraduate Compiler Design course. We present ParseIT, a tool to help students understand the parsing techniques through question-answering. ParseIT…

Programming Languages · Computer Science 2017-02-03 Amey Karkare , Nimisha Agarwal

Multiple-Choice Question Generation: Towards an Automated Assessment Framework

Automated question generation is an important approach to enable personalisation of English comprehension assessment. Recently, transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from…

Computation and Language · Computer Science 2022-09-27 Vatsal Raina , Mark Gales