English
Related papers

Related papers: When Benchmarks Talk: Re-Evaluating Code LLMs with…

200 papers

Standard single-turn, static benchmarks fall short in evaluating the nuanced capabilities of Large Language Models (LLMs) on complex tasks such as software engineering. In this work, we propose a novel interactive evaluation framework that…

Artificial Intelligence · Computer Science 2025-08-27 Dimitrios Rontogiannis , Maxime Peyrard , Nicolas Baldwin , Martin Josifoski , Robert West , Dimitrios Gunopulos

In this paper, we tackle a critical challenge in model evaluation: how to keep code benchmarks useful when models might have already seen them during training. We introduce a novel solution, dynamic benchmarking framework, to address this…

Software Engineering · Computer Science 2025-03-11 Batu Guan , Xiao Wu , Yuanyuan Yuan , Shaohua Li

Evaluation of large language models for code has primarily relied on static benchmarks, including HumanEval (Chen et al., 2021), or more recently using human preferences of LLM responses. As LLMs are increasingly used as programmer…

We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user$\leftrightarrow$agent interaction. The interaction is a conversation between the user and…

Computation and Language · Computer Science 2024-10-14 David Castillo-Bolado , Joseph Davidson , Finlay Gray , Marek Rosa

Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions,…

Software Engineering · Computer Science 2025-02-28 Hojae Han , Seung-won Hwang , Rajhans Samdani , Yuxiong He

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and…

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

Large language models (LLMs) have recently been applied in software engineering to perform tasks such as translating code between programming languages, generating code from natural language, and autocompleting code as it is being written.…

Human-Computer Interaction · Computer Science 2023-02-15 Steven I. Ross , Fernando Martinez , Stephanie Houde , Michael Muller , Justin D. Weisz

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose…

Computation and Language · Computer Science 2024-11-07 Chuyu Zhang , Songyang Zhang , Yingfan Hu , Haowen Shen , Kuikun Liu , Zerun Ma , Fengzhe Zhou , Wenwei Zhang , Xuming He , Dahua Lin , Kai Chen

Conversational AI interfaces powered by large language models (LLMs) are increasingly used as coding assistants. However, questions remain about how programmers interact with LLM-based conversational agents, the challenges they encounter,…

Human-Computer Interaction · Computer Science 2025-03-24 Mehmet Akhoroz , Caglar Yildirim

This Innovative Practice full paper explores how Large Language Models (LLMs) can enhance the teaching of code refactoring in software engineering courses through real-time, context-aware feedback. Refactoring improves code quality but is…

Software Engineering · Computer Science 2025-08-14 Anshul Khairnar , Aarya Rajoju , Edward F. Gehringer

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the…

Large Language Models (LLMs) are transforming programming practices, offering significant capabilities for code generation activities. While researchers have explored the potential of LLMs in various domains, this paper focuses on their use…

Software Engineering · Computer Science 2026-05-04 Deborah Etsenake , Meiyappan Nagappan

LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm in which…

Machine Learning · Computer Science 2026-05-29 Youngbin Choi , Minjong Lee , Saemi Moon , Seunghyuk Cho , Chaehyeon Chung , MoonJeong Park , Dongwoo Kim

Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated…

Software Engineering · Computer Science 2025-05-13 Longtian Wang , Tianlin Li , Xiaofei Xie , Yuhan Zhi , Jian Wang , Chao Shen

This paper presents a benchmark self-evolving framework to dynamically evaluate rapidly advancing Large Language Models (LLMs), aiming for a more accurate assessment of their capabilities and limitations. We utilize a multi-agent system to…

Computation and Language · Computer Science 2024-02-20 Siyuan Wang , Zhuohan Long , Zhihao Fan , Zhongyu Wei , Xuanjing Huang

Evaluating LLMs is challenging, as benchmark scores often fail to capture models' real-world usefulness. Instead, users often rely on ``vibe-testing'': informal experience-based evaluation, such as comparing models on coding tasks related…

Computation and Language · Computer Science 2026-04-17 Itay Itzhak , Eliya Habba , Gabriel Stanovsky , Yonatan Belinkov

Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring…

Human-Computer Interaction · Computer Science 2025-04-14 Jiho Kim , Philippe Laban , Xiang 'Anthony' Chen , Kenneth C. Arnold
‹ Prev 1 2 3 10 Next ›