Human-Aligned Code Readability Assessment with Large Language Models

Wendkûuni C. Ouédraogo; Yinghua Li; Xueqi Dang; Pawel Borsukiewicz; Xin Zhou; Anil Koyuncu; Jacques Klein; David Lo; Tegawendé F. Bissyandé

Human-Aligned Code Readability Assessment with Large Language Models

Software Engineering 2025-10-21 v1

Authors: Wendkûuni C. Ouédraogo , Yinghua Li , Xueqi Dang , Pawel Borsukiewicz , Xin Zhou , Anil Koyuncu , Jacques Klein , David Lo , Tegawendé F. Bissyandé

View on arXiv ↗ PDF ↗

Abstract

Code readability is crucial for software comprehension and maintenance, yet difficult to assess at scale. Traditional static metrics often fail to capture the subjective, context-sensitive nature of human judgments. Large Language Models (LLMs) offer a scalable alternative, but their behavior as readability evaluators remains underexplored. We introduce CoReEval, the first large-scale benchmark for evaluating LLM-based code readability assessment, comprising over 1.4 million model-snippet-prompt evaluations across 10 state of the art LLMs. The benchmark spans 3 programming languages (Java, Python, CUDA), 2 code types (functional code and unit tests), 4 prompting strategies (ZSL, FSL, CoT, ToT), 9 decoding settings, and developer-guided prompts tailored to junior and senior personas. We compare LLM outputs against human annotations and a validated static model, analyzing numerical alignment (MAE, Pearson's, Spearman's) and justification quality (sentiment, aspect coverage, semantic clustering). Our findings show that developer-guided prompting grounded in human-defined readability dimensions improves alignment in structured contexts, enhances explanation quality, and enables lightweight personalization through persona framing. However, increased score variability highlights trade-offs between alignment, stability, and interpretability. CoReEval provides a robust foundation for prompt engineering, model alignment studies, and human in the loop evaluation, with applications in education, onboarding, and CI/CD pipelines where LLMs can serve as explainable, adaptable reviewers.

Keywords

large language model evaluation code generation benchmark evaluation

Cite

@article{arxiv.2510.16579,
  title  = {Human-Aligned Code Readability Assessment with Large Language Models},
  author = {Wendkûuni C. Ouédraogo and Yinghua Li and Xueqi Dang and Pawel Borsukiewicz and Xin Zhou and Anil Koyuncu and Jacques Klein and David Lo and Tegawendé F. Bissyandé},
  journal= {arXiv preprint arXiv:2510.16579},
  year   = {2025}
}

Human-Aligned Code Readability Assessment with Large Language Models

Abstract

Keywords

Cite

Related papers