Related papers: When Benchmarks Talk: Re-Evaluating Code LLMs with…

Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that…

Artificial Intelligence · Computer Science 2025-08-27 Dimitrios Rontogiannis , Maxime Peyrard , Nicolas Baldwin , Martin Josifoski , Robert West , Dimitrios Gunopulos

Is Your Benchmark (Still) Useful? Dynamic Benchmarking for Code Language Models

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this…

Software Engineering · Computer Science 2025-03-11 Batu Guan , Xiao Wu , Yuanyuan Yuan , Shaohua Li

The RealHumanEval: Evaluating Large Language Models' Abilities to Support Programmers

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer…

Software Engineering · Computer Science 2024-10-16 Hussein Mozannar , Valerie Chen , Mohammed Alsobay , Subhro Das , Sebastian Zhao , Dennis Wei , Manish Nagireddy , Prasanna Sattigeri , Ameet Talwalkar , David Sontag

Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and…

Computation and Language · Computer Science 2024-10-14 David Castillo-Bolado , Joseph Davidson , Finlay Gray , Marek Rosa

ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions,…

Software Engineering · Computer Science 2025-02-28 Hojae Han , Seung-won Hwang , Rajhans Samdani , Yuxiong He

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and…

Software Engineering · Computer Science 2025-11-18 Wayne Chi , Valerie Chen , Ryan Shar , Aditya Mittal , Jenny Liang , Wei-Lin Chiang , Anastasios Nikolas Angelopoulos , Ion Stoica , Graham Neubig , Ameet Talwalkar , Chris Donahue

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

The Programmer's Assistant: Conversational Interaction with a Large Language Model for Software Development

Large language models (LLMs) have recently been applied in software engineering to perform tasks such as translating code between programming languages, generating code from natural language, and autocompleting code as it is being written.…

Human-Computer Interaction · Computer Science 2023-02-15 Steven I. Ross , Fernando Martinez , Stephanie Houde , Michael Muller , Justin D. Weisz

Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose…

Computation and Language · Computer Science 2024-11-07 Chuyu Zhang , Songyang Zhang , Yingfan Hu , Haowen Shen , Kuikun Liu , Zerun Ma , Fengzhe Zhou , Wenwei Zhang , Xuming He , Dahua Lin , Kai Chen

Conversational AI as a Coding Assistant: Understanding Programmers' Interactions with and Expectations from Large Language Models for Coding

Conversational AI interfaces powered by large language models (LLMs) are increasingly used as coding assistants. However, questions remain about how programmers interact with LLM-based conversational agents, the challenges they encounter,…

Human-Computer Interaction · Computer Science 2025-03-24 Mehmet Akhoroz , Caglar Yildirim

Teaching Code Refactoring Using LLMs

This Innovative Practice full paper explores how Large Language Models (LLMs) can enhance the teaching of code refactoring in software engineering courses through real-time, context-aware feedback. Refactoring improves code quality but is…

Software Engineering · Computer Science 2025-08-14 Anshul Khairnar , Aarya Rajoju , Edward F. Gehringer

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the…

Computation and Language · Computer Science 2025-10-27 Chunyu Miao , Henry Peng Zou , Yangning Li , Yankai Chen , Yibo Wang , Fangxin Wang , Yifan Li , Wooseong Yang , Bowei He , Xinni Zhang , Dianzhi Yu , Hanchen Yang , Hoang H Nguyen , Yue Zhou , Jie Yang , Jizhou Guo , Wenzhe Fan , Chin-Yuan Yeh , Panpan Meng , Liancheng Fang , Jinhu Qi , Wei-Chieh Huang , Zhengyao Gu , Yuwei Han , Langzhou He , Yuyao Yang , Yinghui Li , Hai-Tao Zheng , Xue Liu , Irwin King , Philip S. Yu

Understanding the Human-LLM Dynamic: A Literature Survey of LLM Use in Programming Tasks

Large Language Models (LLMs) are transforming programming practices, offering significant capabilities for code generation activities. While researchers have explored the potential of LLMs in various domains, this paper focuses on their use…

Software Engineering · Computer Science 2026-05-04 Deborah Etsenake , Meiyappan Nagappan

In-Place Feedback: Reliable Refinement for Multi-Turn Expert-LLM Collaboration

LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm in which…

Machine Learning · Computer Science 2026-05-29 Youngbin Choi , Minjong Lee , Saemi Moon , Seunghyuk Cho , Chaehyeon Chung , MoonJeong Park , Dongwoo Kim

Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach

Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated…

Software Engineering · Computer Science 2025-05-13 Longtian Wang , Tianlin Li , Xiaofei Xie , Yuhan Zhi , Jian Wang , Chao Shen

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to…

Computation and Language · Computer Science 2024-02-20 Siyuan Wang , Zhuohan Long , Zhihao Fan , Zhongyu Wei , Xuanjing Huang

From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related…

Computation and Language · Computer Science 2026-04-17 Itay Itzhak , Eliya Habba , Gabriel Stanovsky , Yonatan Belinkov

Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing

Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring…

Human-Computer Interaction · Computer Science 2025-04-14 Jiho Kim , Philippe Laban , Xiang 'Anthony' Chen , Kenneth C. Arnold