Related papers: DOCE: Finding the Sweet Spot for Execution-Based C…

Revisit Self-Debugging with Self-Generated Tests for Code Generation

Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of…

Software Engineering · Computer Science 2025-01-23 Xiancai Chen , Zhengwei Tao , Kechi Zhang , Changzhi Zhou , Wanli Gu , Yuanpeng He , Mengdi Zhang , Xunliang Cai , Haiyan Zhao , Zhi Jin

Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates

Large Language Models (LLMs), such as GPT-4, StarCoder, and CodeLlama, are transforming the way developers approach programming by automatically generating code based on given natural language descriptions. Despite advancements, generating…

Software Engineering · Computer Science 2024-09-20 Zhihong Sun , Yao Wan , Jia Li , Hongyu Zhang , Zhi Jin , Ge Li , Chen Lyu

Self-Execution Simulation Improves Coding Models

A promising research direction in enabling LLMs to generate consistently correct code involves addressing their inability to properly estimate program execution, particularly for code they generate. In this work, we demonstrate that Code…

Computation and Language · Computer Science 2026-04-07 Gallil Maimon , Ori Yoran , Felix Kreuk , Michael Hassid , Gal Cohen , Pierre Chambon , Yossi Adi

CodeScore: Evaluating Code Generation by Learning Code Execution

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is…

Software Engineering · Computer Science 2024-10-04 Yiqing Xie , Alex Xie , Divyanshu Sheth , Pengfei Liu , Daniel Fried , Carolyn Rose

Natural Language to Code Translation with Execution

Generative models of code, pretrained on large corpora of programs, have shown great success in translating natural language to code (Chen et al., 2021; Austin et al., 2021; Li et al., 2022, inter alia). While these models do not explicitly…

Computation and Language · Computer Science 2022-11-02 Freda Shi , Daniel Fried , Marjan Ghazvininejad , Luke Zettlemoyer , Sida I. Wang

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

Decompilation -- recovering source code from compiled binaries -- is essential for security analysis, malware reverse engineering, and legacy software maintenance. However, existing decompilers produce code that often fails to compile or…

Software Engineering · Computer Science 2026-05-05 Yifan Zhang , Xiaohan Wang , Yueke Zhang , Yu Huang , Kevin Leach

LEVER: Learning to Verify Language-to-Code Generation with Execution

The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases…

Machine Learning · Computer Science 2023-09-04 Ansong Ni , Srini Iyer , Dragomir Radev , Ves Stoyanov , Wen-tau Yih , Sida I. Wang , Xi Victoria Lin

Enhancing LLM Code Generation: A Systematic Evaluation of Multi-Agent Collaboration and Runtime Debugging for Improved Accuracy, Reliability, and Latency

The use of large language models (LLMs) for automated code generation has emerged as a significant focus within AI research. As these pretrained models continue to evolve, their ability to understand and generate complex code structures has…

Software Engineering · Computer Science 2025-05-06 Nazmus Ashrafi , Salah Bouktif , Mohammed Mediani

LeDex: Training LLMs to Better Self-Debug and Explain Code

In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging…

Computation and Language · Computer Science 2025-02-17 Nan Jiang , Xiaopeng Li , Shiqi Wang , Qiang Zhou , Soneya Binta Hossain , Baishakhi Ray , Varun Kumar , Xiaofei Ma , Anoop Deoras

A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models

Code Executing Reasoning is becoming a new non-functional metric that assesses the ability of large language models (LLMs) in programming tasks. State-of-the-art frameworks (CodeMind or REval) and benchmarks (CruxEval) usually focus on…

Software Engineering · Computer Science 2025-01-31 Changshu Liu , Reyhaneh Jabbarvand

Designing Empirical Studies on LLM-Based Code Generation: Towards a Reference Framework

The rise of large language models (LLMs) has introduced transformative potential in automated code generation, addressing a wide range of software engineering challenges. However, empirical evaluation of LLM-based code generation lacks…

Software Engineering · Computer Science 2025-10-07 Nathalia Nascimento , Everton Guimaraes , Paulo Alencar

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs.…

Software Engineering · Computer Science 2024-06-12 Li Zhong , Zilong Wang , Jingbo Shang

Execution Guided Line-by-Line Code Generation

We present a novel approach to neural code generation that incorporates real-time execution signals into the language model generation process. While large language models (LLMs) have demonstrated impressive code generation capabilities,…

Machine Learning · Computer Science 2025-10-24 Boaz Lavon , Shahar Katz , Lior Wolf

ExeDec: Execution Decomposition for Compositional Generalization in Neural Program Synthesis

When writing programs, people have the ability to tackle a new complex task by decomposing it into smaller and more familiar subtasks. While it is difficult to measure whether neural program synthesis methods have similar capabilities, we…

Machine Learning · Computer Science 2024-05-07 Kensen Shi , Joey Hong , Yinlin Deng , Pengcheng Yin , Manzil Zaheer , Charles Sutton

MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level…

Machine Learning · Computer Science 2026-01-12 Jiefu Ou , Sapana Chaudhary , Kaj Bostrom , Nathaniel Weir , Shuai Zhang , Huzefa Rangwala , George Karypis

Context-Guided Decompilation: A Step Towards Re-executability

Binary decompilation plays an important role in software security analysis, reverse engineering, and malware understanding when source code is unavailable. However, existing decompilation techniques often fail to produce source code that…

Software Engineering · Computer Science 2026-04-14 Xiaohan Wang , Yuxin Hu , Kevin Leach

Execution-Based Evaluation of Natural Language to Bash and PowerShell for Incident Remediation

Given recent advancements of Large Language Models (LLMs), code generation tasks attract immense attention for wide application in different domains. In an effort to evaluate and select a best model to automatically remediate system…

Computation and Language · Computer Science 2024-12-18 Ngoc Phuoc An Vo , Brent Paulovicks , Vadim Sheinin

ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?

Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in…

Computation and Language · Computer Science 2024-10-11 Siddhant Waghjale , Vishruth Veerendranath , Zora Zhiruo Wang , Daniel Fried

DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode

This work addresses test output prediction, a key challenge in test case generation. To improve the reliability of predicted outputs by LLMs, prior approaches generate code first to ground predictions. One grounding strategy is direct…

Software Engineering · Computer Science 2026-04-14 Hojae Han , Jaejin Kim , Seung-won Hwang , Yu Jin Kim , Moontae Lee