Related papers: Estimating Item Difficulty Using Large Language Mo…

Estimating Item Difficulty with Large Language Models as Experts

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow,…

Methodology · Statistics 2026-05-19 Diana Kolesnikova , Kirill Fedyanin , Abe D. Hofman , Matthieu J. S. Brinkhuis , Maria Bolsinova

Using Vision + Language Models to Predict Item Difficulty

This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the…

Artificial Intelligence · Computer Science 2026-03-06 Samin Khan

Text-Based Approaches to Item Difficulty Modeling in Large-Scale Assessments: A Systematic Review

Item difficulty plays a crucial role in test performance, interpretability of scores, and equity for all test-takers, especially in large-scale assessments. Traditional approaches to item difficulty modeling rely on field testing and…

Computation and Language · Computer Science 2025-09-30 Sydney Peters , Nan Zhang , Hong Jiao , Ming Li , Tianyi Zhou , Robert Lissitz

Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Standardized math assessments require expensive human pilot studies to establish the difficulty of test items. We investigate the predictive value of open-source large language models (LLMs) for evaluating the difficulty of multiple-choice…

Computation and Language · Computer Science 2026-04-22 Christabel Acquaye , Yi Ting Huang , Marine Carpuat , Rachel Rudinger

Synthetic Student Responses: LLM-Extracted Features for IRT Difficulty Parameter Estimation

Educational assessment relies heavily on knowing question difficulty, traditionally determined through resource-intensive pre-testing with students. This creates significant barriers for both classroom teachers and assessment developers. We…

Computers and Society · Computer Science 2026-02-03 Matias Hoyl

Scaling Item-to-Standard Alignment with Large Language Models: Accuracy, Limits, and Solutions

As educational systems evolve, ensuring that assessment items remain aligned with content standards is essential for maintaining fairness and instructional relevance. Traditional human alignment reviews are accurate but slow and…

Artificial Intelligence · Computer Science 2025-11-26 Farzan Karimi-Malekabadi , Pooya Razavi , Sonya Powers

Exploring the Potential of Large Language Models for Estimating the Reading Comprehension Question Difficulty

Reading comprehension is a key for individual success, yet the assessment of question difficulty remains challenging due to the extensive human annotation and large-scale testing required by traditional methods such as linguistic analysis…

Computation and Language · Computer Science 2025-02-26 Yoshee Jain , John Hollander , Amber He , Sunny Tang , Liang Zhang , John Sabatini

Scalable Text-Embedding-informed Cognitive Diagnosis of Large Language Models

Large language models (LLMs) have achieved remarkable performance on diverse benchmarks, yet existing evaluation practices largely rely on coarse summary metrics that obscure underlying reasoning abilities. In this work, we propose novel…

Methodology · Statistics 2026-03-17 Jia Liu , Zhiyu Xu , Yuqi Gu

A Prompt-Engineered Large Language Model, Deep Learning Workflow for Materials Classification

Large language models (LLMs) have demonstrated rapid progress across a wide array of domains. Owing to the very large number of parameters and training data in LLMs, these models inherently encompass an expansive and comprehensive materials…

Materials Science · Physics 2024-11-20 Siyu Liu , Tongqi Wen , A. S. L. Subrahmanyam Pattamatta , David J. Srolovitz

Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity…

Computation and Language · Computer Science 2026-05-20 Seonjeong Hwang , Hyounghun Kim , Gary Geunbae Lee

Prediction of Item Difficulty for Reading Comprehension Items by Creation of Annotated Item Repository

Prediction of item difficulty based on its text content is of substantial interest. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct…

Computation and Language · Computer Science 2026-04-01 Radhika Kapoor , Sang T. Truong , Nick Haber , Maria Araceli Ruiz-Primo , Benjamin W. Domingue

Estimating Exam Item Difficulty with LLMs: A Benchmark on Brazil's ENEM Corpus

As Large Language Models (LLMs) are increasingly deployed to generate educational content, a critical safety question arises: can these models reliably estimate the difficulty of the questions they produce? Using Brazil's high-stakes ENEM…

Computers and Society · Computer Science 2026-02-09 Thiago Brant , Julien Kühn , Jun Pang

HLLM: Enhancing Sequential Recommendations via Hierarchical Large Language Models for Item and User Modeling

Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems. However, these attempts have so far resulted in only modest improvements over…

Information Retrieval · Computer Science 2024-09-20 Junyi Chen , Lu Chi , Bingyue Peng , Zehuan Yuan

Enhancing Item Tokenization for Generative Recommendation through Self-Improvement

Generative recommendation systems, driven by large language models (LLMs), present an innovative approach to predicting user preferences by modeling items as token sequences and generating recommendations in a generative manner. A critical…

Machine Learning · Computer Science 2024-12-24 Runjin Chen , Mingxuan Ju , Ngoc Bui , Dimosthenis Antypas , Stanley Cai , Xiaopeng Wu , Leonardo Neves , Zhangyang Wang , Neil Shah , Tong Zhao

Estimating problem difficulty without ground truth using Large Language Model comparisons

Recent advances in the finetuning of large language models (LLMs) have significantly improved their performance on established benchmarks, emphasizing the need for increasingly difficult, synthetic data. A key step in this data generation…

Machine Learning · Computer Science 2025-12-17 Marthe Ballon , Andres Algaba , Brecht Verbeken , Vincent Ginis

Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests

Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and…

Computation and Language · Computer Science 2025-10-14 Yanbin Fu , Hong Jiao , Tianyi Zhou , Nan Zhang , Ming Li , Qingshu Xu , Sydney Peters , Robert W. Lissitz

Leveraging Large Language Models for Predicting Cost and Duration in Software Engineering Projects

Accurate estimation of project costs and durations remains a pivotal challenge in software engineering, directly impacting budgeting and resource management. Traditional estimation techniques, although widely utilized, often fall short due…

Software Engineering · Computer Science 2024-09-17 Justin Carpenter , Chia-Ying Wu , Nasir U. Eisty

Reconstructing Item Characteristic Curves using Fine-Tuned Large Language Models

Traditional methods for determining assessment item parameters, such as difficulty and discrimination, rely heavily on expensive field testing to collect student performance data for Item Response Theory (IRT) calibration. This study…

Computation and Language · Computer Science 2026-01-07 Christopher Ormerod

Revisiting Generalization Across Difficulty Levels: It's Not So Easy

We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data…

Computation and Language · Computer Science 2025-11-27 Yeganeh Kordi , Nihal V. Nayak , Max Zuo , Ilana Nguyen , Stephen H. Bach

Evaluating Large Language Models for Material Selection

Material selection is a crucial step in conceptual design due to its significant impact on the functionality, aesthetics, manufacturability, and sustainability impact of the final product. This study investigates the use of Large Language…

Computation and Language · Computer Science 2024-05-08 Daniele Grandi , Yash Patawari Jain , Allin Groom , Brandon Cramer , Christopher McComb