Related papers: ReCode: Robustness Evaluation of Code Generation M…

ReCode: Reinforcing Code Generation with Reasoning-Process Rewards

In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces…

Software Engineering · Computer Science 2026-05-06 Lishui Fan , Yu Zhang , Mouxiang Chen , Zhongxin Liu

A Multi-Language Perspective on the Robustness of LLM Code Generation

Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the…

Software Engineering · Computer Science 2026-05-05 Fazle Rabbi , Zishuo Ding , Jinqiu Yang

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation

Large language models are increasingly used for code generation, yet the correctness of their outputs depends not only on model capability but also on how tasks are specified. Prior studies demonstrate that small changes in natural language…

Software Engineering · Computer Science 2026-04-28 Amal AKLI , Mike PAPADAKIS , Maxime CORDY , Yves Le TRAON

CodeFort: Robust Training for Code Generation Models

Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the performance of these models. Although improving the robustness of code generation models is crucial to…

Software Engineering · Computer Science 2024-10-30 Yuhao Zhang , Shiqi Wang , Haifeng Qian , Zijian Wang , Mingyue Shang , Linbo Liu , Sanjay Krishna Gouda , Baishakhi Ray , Murali Krishna Ramanathan , Xiaofei Ma , Anoop Deoras

COCO: Testing Code Generation Systems via Concretized Instructions

Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly…

Software Engineering · Computer Science 2023-08-28 Ming Yan , Junjie Chen , Jie M. Zhang , Xuejie Cao , Chen Yang , Mark Harman

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

A Preliminary Study on the Robustness of Code Generation by Large Language Models

Robustness is a critical factor for reliable code generation by large language models, yet most evaluations focus on correctness and overlook key issues such as missing input validation and inadequate error handling. In this work, we…

Software Engineering · Computer Science 2025-09-24 Zike Li , Mingwei Liu , Anji Li , Kaifeng He , Yanlin Wang , Xin Peng , Zibin Zheng

On Robustness of Prompt-based Semantic Parsing with Large Pre-trained Language Model: An Empirical Study on Codex

Semantic parsing is a technique aimed at constructing a structured representation of the meaning of a natural-language question. Recent advancements in few-shot language models trained on code have demonstrated superior performance in…

Computation and Language · Computer Science 2023-03-10 Terry Yue Zhuo , Zhuang Li , Yujin Huang , Fatemeh Shiri , Weiqing Wang , Gholamreza Haffari , Yuan-Fang Li

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address…

Software Engineering · Computer Science 2025-01-08 Tianyu Zheng , Ge Zhang , Tianhao Shen , Xueling Liu , Bill Yuchen Lin , Jie Fu , Wenhu Chen , Xiang Yue

ReCode: Updating Code API Knowledge with Reinforcement Learning

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…

Computation and Language · Computer Science 2025-11-25 Haoze Wu , Yunzhi Yao , Wenhao Yu , Ningyu Zhang

RepoMasterEval: Evaluating Code Completion via Real-World Repositories

With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…

Software Engineering · Computer Science 2025-11-03 Qinyun Wu , Chao Peng , Pengfei Gao , Ruida Hu , Haoyu Gan , Bo Jiang , Jinhe Tang , Zhiwen Deng , Zhanming Guan , Cuiyun Gao , Xia Liu , Ping Yang

An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities

Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…

Software Engineering · Computer Science 2025-01-24 Zezhou Yang , Sirong Chen , Cuiyun Gao , Zhenhao Li , Xing Hu , Kui Liu , Xin Xia

On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot

Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step…

Software Engineering · Computer Science 2023-02-02 Antonio Mastropaolo , Luca Pascarella , Emanuela Guglielmi , Matteo Ciniselli , Simone Scalabrino , Rocco Oliveto , Gabriele Bavota

On the Reliability and Explainability of Language Models for Program Generation

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and…

Software Engineering · Computer Science 2024-01-09 Yue Liu , Chakkrit Tantithamthavorn , Yonghui Liu , Li Li

CodeT: Code Generation with Generated Tests

The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select…

Computation and Language · Computer Science 2022-11-24 Bei Chen , Fengji Zhang , Anh Nguyen , Daoguang Zan , Zeqi Lin , Jian-Guang Lou , Weizhu Chen

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They…

Software Engineering · Computer Science 2025-01-03 Zhaojian Yu , Yilun Zhao , Arman Cohan , Xiao-Ping Zhang

Evaluating How Fine-tuning on Bimodal Data Effects Code Generation

Despite the increase in popularity of language models for code generation, it is still unknown how training on bimodal coding forums affects a model's code generation performance and reliability. We, therefore, collect a dataset of over…

Machine Learning · Computer Science 2022-11-16 Gabriel Orlanski , Seonhye Yang , Michael Healy

NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations

Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The…

Software Engineering · Computer Science 2024-07-01 Junkai Chen , Zhenhao Li , Xing Hu , Xin Xia

When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity,…

Software Engineering · Computer Science 2025-07-29 Maya Larbi , Amal Akli , Mike Papadakis , Rihab Bouyousfi , Maxime Cordy , Federica Sarro , Yves Le Traon

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan