Related papers: ReCode: Robustness Evaluation of Code Generation M…
In practice, rigorous reasoning is often a key driver of correct code, while Reinforcement Learning (RL) for code generation often neglects optimizing reasoning quality. Bringing process-level supervision into RL is appealing, but it faces…
Large language models have gained significant traction and popularity in recent times, extending their usage to code-generation tasks. While this field has garnered considerable attention, the exploration of testing and evaluating the…
Large language models are increasingly used for code generation, yet the correctness of their outputs depends not only on model capability but also on how tasks are specified. Prior studies demonstrate that small changes in natural language…
Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the performance of these models. Although improving the robustness of code generation models is crucial to…
Code generation systems have been extensively developed in recent years to generate source code based on natural language instructions. However, despite their advancements, these systems still face robustness issues where even slightly…
In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…
Robustness is a critical factor for reliable code generation by large language models, yet most evaluations focus on correctness and overlook key issues such as missing input validation and inadequate error handling. In this work, we…
Semantic parsing is a technique aimed at constructing a structured representation of the meaning of a natural-language question. Recent advancements in few-shot language models trained on code have demonstrated superior performance in…
The introduction of large language models has significantly advanced code generation. However, open-source models often lack the execution capabilities and iterative refinement of advanced systems like the GPT-4 Code Interpreter. To address…
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…
With the growing reliance on automated code completion tools in software development, the need for comprehensive evaluation benchmarks has become critical. Existing benchmarks focus more on code completion in function and class level by…
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. The continuous advancements in deep learning, particularly pre-trained models, have empowered the code…
Software engineering research has always being concerned with the improvement of code completion approaches, which suggest the next tokens a developer will likely type while coding. The release of GitHub Copilot constitutes a big step…
Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and…
The task of generating code solutions for a given programming problem can benefit from the use of pre-trained language models such as Codex, which can produce multiple diverse samples. However, a major challenge for this task is to select…
We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They…
Despite the increase in popularity of language models for code generation, it is still unknown how training on bimodal coding forums affects a model's code generation performance and reliability. We, therefore, collect a dataset of over…
Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The…
Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity,…
Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…