English
Related papers

Related papers: Comparing Developer and LLM Biases in Code Evaluat…

200 papers

Learning analytics researchers often analyze qualitative student data such as coded annotations or interview transcripts to understand learning processes. With the rise of generative AI, fully automated and human-AI workflows have emerged…

Computation and Language · Computer Science 2026-01-21 Elham Tajik , Conrad Borchers , Bahar Shahrokhian , Sebastian Simon , Ali Keramati , Sonika Pal , Sreecharan Sankaranarayanan

Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make…

Computation and Language · Computer Science 2026-05-12 Aakash Sen Sharma , Debdeep Sanyal , Manodeep Ray , Vivek Srivastava , Shirish Karande , Murari Mandal

LLM-based software engineering assistants fail not only by producing incorrect outputs, but also by allocating trust to the wrong artifact when code, documentation, and tests disagree. Existing evaluations focus mainly on downstream…

Software Engineering · Computer Science 2026-04-07 Noshin Ulfat , Ahsanul Ameen Sabit , Soneya Binta Hossain

Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. However, the continual learning aspect of these aligned LLMs has been largely overlooked. Existing…

Computation and Language · Computer Science 2023-10-11 Xiao Wang , Yuansen Zhang , Tianze Chen , Songyang Gao , Senjie Jin , Xianjun Yang , Zhiheng Xi , Rui Zheng , Yicheng Zou , Tao Gui , Qi Zhang , Xuanjing Huang

Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself…

Artificial Intelligence · Computer Science 2026-05-29 Yundong Kim , Heyoung Yang

Using large language models (LLMs) to annotate relevance is an increasingly important technique in the information retrieval community. While some studies demonstrate that LLMs can achieve high user agreement with ground truth (human)…

Information Retrieval · Computer Science 2026-01-15 Watheq Mansour , J. Shane Culpepper , Joel Mackenzie , Andrew Yates

Large Language Model (LLM) judges exhibit strong reasoning capabilities but are limited to textual content. This leaves current automatic Speech-to-Speech (S2S) evaluation methods reliant on opaque and expensive Audio Language Models…

Computation and Language · Computer Science 2026-01-27 Arjun Chandra , Kevin Miller , Venkatesh Ravichandran , Constantinos Papayiannis , Venkatesh Saligrama

Large language models (LLMs) are increasingly used as raters for evaluation tasks. However, their reliability is often limited for subjective tasks, when human judgments involve subtle reasoning beyond annotation labels. Thinking traces,…

Artificial Intelligence · Computer Science 2026-02-23 Xingjian Zhang , Tianhong Gao , Suliang Jin , Tianhao Wang , Teng Ye , Eytan Adar , Qiaozhu Mei

LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human…

Computation and Language · Computer Science 2025-12-19 Yuanning Feng , Sinan Wang , Zhengxiang Cheng , Yao Wan , Dongping Chen

Evaluating the alignment of large language models (LLMs) with user-defined coding preferences is a challenging endeavour that requires a deep assessment of LLMs' outputs. Existing methods and benchmarks rely primarily on automated metrics…

Software Engineering · Computer Science 2024-12-30 Martin Weyssow , Aton Kamanda , Xin Zhou , Houari Sahraoui

New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold…

Artificial Intelligence · Computer Science 2025-12-25 Suryaansh Jain , Umair Z. Ahmed , Shubham Sahai , Ben Leong

Large Language Models are increasingly used as judges to evaluate code artifacts when exhaustive human review or executable test coverage is unavailable. LLM-judge is increasingly relevant in agentic software engineering workflows, where it…

Software Engineering · Computer Science 2026-04-21 Zixiao Zhao , Amirreza Esmaeili , Fatemeh Fard

Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to…

Computation and Language · Computer Science 2026-02-10 Jiangnan Fang , Cheng-Tse Liu , Hanieh Deilamsalehy , Nesreen K. Ahmed , Puneet Mathur , Nedim Lipka , Franck Dernoncourt , Ryan A. Rossi

Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many…

Computation and Language · Computer Science 2025-08-19 Aman Singh Thakur , Kartik Choudhary , Venkat Srinik Ramayapally , Sankaran Vaidyanathan , Dieuwke Hupkes

LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be…

Software Engineering · Computer Science 2026-05-01 Md Faizul Ibne Amin , Yutaka Watanobe , Daniel M. Muepu , Haruto Suzuki , Kenta Nanaumi , Md Mostafizer Rahman

Recent advances in reasoning-focused Large Language Models (LLMs) have introduced Chain-of-Thought (CoT) traces - intermediate reasoning steps generated before a final answer. These traces, as in DeepSeek R1, guide inference and train…

Computation and Language · Computer Science 2026-04-20 Siddhant Bhambri , Upasana Biswas , Subbarao Kambhampati

With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers…

Computation and Language · Computer Science 2026-01-06 Jiwon Moon , Yerin Hwang , Dongryeol Lee , Taegwan Kang , Yongil Kim , Kyomin Jung

Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language…

Software Engineering · Computer Science 2025-12-02 Mohammad Abdollahi , Khandaker Rifah Tasnia , Soumit Kanti Saha , Jinqiu Yang , Song Wang , Hadi Hemmati

Evaluation of large language model (LLM) outputs requires users to make critical judgments about the best outputs across various configurations. This process is costly and takes time given the large amounts of data. LLMs are increasingly…

Adopting human and large language models (LLM) as judges (a.k.a human- and LLM-as-a-judge) for evaluating the performance of LLMs has recently gained attention. Nonetheless, this approach concurrently introduces potential biases from human…

Computation and Language · Computer Science 2024-09-27 Guiming Hardy Chen , Shunian Chen , Ziche Liu , Feng Jiang , Benyou Wang
‹ Prev 1 2 3 10 Next ›