Related papers: Comparing Developer and LLM Biases in Code Evaluat…

Disagreement as Data: Reasoning Trace Analytics in Multi-Agent Systems

Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged…

Computation and Language · Computer Science 2026-01-21 Elham Tajik , Conrad Borchers , Bahar Shahrokhian , Sebastian Simon , Ali Keramati , Sonika Pal , Sreecharan Sankaranarayanan

The Realignment Problem: When Right becomes Wrong in LLMs

Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make…

Computation and Language · Computer Science 2026-05-12 Aakash Sen Sharma , Debdeep Sanyal , Manodeep Ray , Vivek Srivastava , Shirish Karande , Murari Mandal

Measuring LLM Trust Allocation Across Conflicting Software Artifacts

LLM-based software engineering assistants fail not only by producing incorrect outputs, but also by allocating trust to the wrong artifact when code, documentation, and tests disagree. Existing evaluations focus mainly on downstream…

Software Engineering · Computer Science 2026-04-07 Noshin Ulfat , Ahsanul Ameen Sabit , Soneya Binta Hossain

TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models

Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing…

Computation and Language · Computer Science 2023-10-11 Xiao Wang , Yuansen Zhang , Tianze Chen , Songyang Gao , Senjie Jin , Xianjun Yang , Zhiheng Xi , Rui Zheng , Yicheng Zou , Tao Gui , Qi Zhang , Xuanjing Huang

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself…

Artificial Intelligence · Computer Science 2026-05-29 Yundong Kim , Heyoung Yang

Revisiting Human-vs-LLM judgments using the TREC Podcast Track

Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human)…

Information Retrieval · Computer Science 2026-01-15 Watheq Mansour , J. Shane Culpepper , Joel Mackenzie , Andrew Yates

Hearing Between the Lines: Unlocking the Reasoning Power of LLMs for Speech Evaluation

Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models…

Computation and Language · Computer Science 2026-01-27 Arjun Chandra , Kevin Miller , Venkatesh Ravichandran , Constantinos Papayiannis , Venkatesh Saligrama

Through the Judge's Eyes: Inferred Thinking Traces Improve Reliability of LLM Raters

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces,…

Artificial Intelligence · Computer Science 2026-02-23 Xingjian Zhang , Tianhong Gao , Suliang Jin , Tianhao Wang , Teng Ye , Eytan Adar , Qiaozhu Mei

Are We on the Right Way to Assessing LLM-as-a-Judge?

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human…

Computation and Language · Computer Science 2025-12-19 Yuanning Feng , Sinan Wang , Zhengxiang Cheng , Yao Wan , Dongping Chen

CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics…

Software Engineering · Computer Science 2024-12-30 Martin Weyssow , Aton Kamanda , Xin Zhou , Houari Sahraoui

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold…

Artificial Intelligence · Computer Science 2025-12-25 Suryaansh Jain , Umair Z. Ahmed , Shubham Sahai , Ben Leong

Bias in the Loop: Auditing LLM-as-a-Judge for Software Engineering

Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it…

Software Engineering · Computer Science 2026-04-21 Zixiao Zhao , Amirreza Esmaeili , Fatemeh Fard

Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to…

Computation and Language · Computer Science 2026-02-10 Jiangnan Fang , Cheng-Tse Liu , Hanieh Deilamsalehy , Nesreen K. Ahmed , Puneet Mathur , Nedim Lipka , Franck Dernoncourt , Ryan A. Rossi

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many…

Computation and Language · Computer Science 2025-08-19 Aman Singh Thakur , Kartik Choudhary , Venkat Srinik Ramayapally , Sankaran Vaidyanathan , Dieuwke Hupkes

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be…

Software Engineering · Computer Science 2026-05-01 Md Faizul Ibne Amin , Yutaka Watanobe , Daniel M. Muepu , Haruto Suzuki , Kenta Nanaumi , Md Mostafizer Rahman

Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

Recent advances in reasoning-focused Large Language Models (LLMs) have introduced Chain-of-Thought (CoT) traces - intermediate reasoning steps generated before a final answer. These traces, as in DeepSeek R1, guide inference and train…

Computation and Language · Computer Science 2026-04-20 Siddhant Bhambri , Upasana Biswas , Subbarao Kambhampati

Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers…

Computation and Language · Computer Science 2026-01-06 Jiwon Moon , Yerin Hwang , Dongryeol Lee , Taegwan Kang , Yongil Kim , Kyomin Jung

Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation

Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language…

Software Engineering · Computer Science 2025-12-02 Mohammad Abdollahi , Khandaker Rifah Tasnia , Soumit Kanti Saha , Jinqiu Yang , Song Wang , Hadi Hemmati

Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly…

Human-Computer Interaction · Computer Science 2025-08-07 Zahra Ashktorab , Michael Desmond , Qian Pan , James M. Johnson , Martin Santillan Cooper , Elizabeth M. Daly , Rahul Nair , Tejaswini Pedapati , Hyo Jin Do , Werner Geyer

Humans or LLMs as the Judge? A Study on Judgement Biases

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human…

Computation and Language · Computer Science 2024-09-27 Guiming Hardy Chen , Shunian Chen , Ziche Liu , Feng Jiang , Benyou Wang