Related papers: Clover: Closed-Loop Verifiable Code Generation

CLEVER: A Curated Benchmark for Formally Verified Code Generation

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out…

Machine Learning · Computer Science 2025-10-24 Amitayush Thakur , Jasper Lee , George Tsoukalas , Meghana Sistla , Matthew Zhao , Stefan Zetzsche , Greg Durrett , Yisong Yue , Swarat Chaudhuri

Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions

Large Language Models (LLMs) have significantly advanced automated test generation, yet existing methods often rely on ground-truth code for verification, risking bug propagation and limiting applicability in test-driven development. We…

Software Engineering · Computer Science 2026-02-12 Hamed Taherkhani , Alireza DaghighFarsoodeh , Mohammad Chowdhury , Hung Viet Pham , Hadi Hemmati

CIFE: Code Instruction-Following Evaluation

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness,…

Software Engineering · Computer Science 2025-12-22 Sravani Gunnu , Shanmukha Guttula , Hima Patel

IFEvalCode: Controlled Code Generation

Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed…

Computation and Language · Computer Science 2025-08-04 Jian Yang , Wei Zhang , Shukai Liu , Linzheng Chai , Yingshui Tan , Jiaheng Liu , Ge Zhang , Wangchunshu Zhou , Guanglin Niu , Zhoujun Li , Binyuan Hui , Junyang Lin

CLOVER: A Test Case Generation Benchmark with Coverage, Long-Context, and Verification

Software testing is a critical aspect of software development, yet generating test cases remains a routine task for engineers. This paper presents a benchmark, CLOVER, to evaluate models' capabilities in generating and completing test cases…

Software Engineering · Computer Science 2025-02-14 Jiacheng Xu , Bo Pang , Jin Qu , Hiroaki Hayashi , Caiming Xiong , Yingbo Zhou

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Large Language Models (LLMs) have achieved state-of-the-art performance across software engineering tasks, from code generation to translation. However, we identify and systematically evaluate a critical failure mode: Programming Language…

Software Engineering · Computer Science 2026-02-03 Micheline Bénédicte Moumoula , Serge Lionel Nikiema , Abdoul Kader Kabore , Jacques Klein , Tegawendé F. Bissyande

Evaluating LLM-Generated Code: A Benchmark and Developer Study

Code generation is one of the tasks for which the use of Large Language Models is widely adopted and highly successful. Given this popularity, there are many benchmarks dedicated to code generation that can help select the best model.…

Software Engineering · Computer Science 2026-05-12 Joanna Szych , Anne Schwerk

Combining LLM Code Generation with Formal Specifications and Reactive Program Synthesis

In the past few years, Large Language Models (LLMs) have exploded in usefulness and popularity for code generation tasks. However, LLMs still struggle with accuracy and are unsuitable for high-risk applications without additional oversight…

Software Engineering · Computer Science 2024-10-29 William Murphy , Nikolaus Holzer , Feitong Qiao , Leyi Cui , Raven Rothkopf , Nathan Koenig , Mark Santolucito

Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning

Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we…

Software Engineering · Computer Science 2025-09-12 Jia Fu , Xinyu Yang , Hongzhi Zhang , Yahui Liu , Jingyuan Zhang , Qi Wang , Fuzheng Zhang , Guorui Zhou

Showing LLM-Generated Code Selectively Based on Confidence of LLMs

Large Language Models (LLMs) have shown impressive abilities in code generation, but they may generate erroneous programs. Reading a program takes ten times longer than writing it. Showing these erroneous programs to developers will waste…

Software Engineering · Computer Science 2024-10-07 Jia Li , Yuqi Zhu , Yongmin Li , Ge Li , Zhi Jin

Understanding Defects in Generated Codes by Language Models

This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation,…

Software Engineering · Computer Science 2024-08-27 Ali Mohammadi Esfahani , Nafiseh Kahani , Samuel A. Ajila

From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification

Large Language Models (LLMs) show promise in automated software engineering, yet their guarantee of correctness is frequently undermined by erroneous or hallucinated code. To enforce model honesty, formal verification requires LLMs to…

Software Engineering · Computer Science 2026-04-27 Md Erfan , Md Kamal Hossain Chowdhury , Ahmed Ryan , Md Rayhanur Rahman

Validating LLM-Generated Programs with Metamorphic Prompt Testing

The latest paradigm shift in software development brings in the innovation and automation afforded by Large Language Models (LLMs), showcased by Generative Pre-trained Transformer (GPT), which has shown remarkable capacity to generate code…

Software Engineering · Computer Science 2024-06-12 Xiaoyin Wang , Dakai Zhu

VERINA: Benchmarking Verifiable Code Generation

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating…

Machine Learning · Computer Science 2026-03-18 Zhe Ye , Zhengxu Yan , Jingxuan He , Timothe Kasriel , Kaiyu Yang , Dawn Song

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure…

Software Engineering · Computer Science 2023-11-01 Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

Rethinking the Evaluation of Secure Code Generation

Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current…

Cryptography and Security · Computer Science 2025-11-14 Shih-Chieh Dai , Jun Xu , Guanhong Tao

LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests

The usage of Large Language Models (LLMs) for software and test development has continued to increase since LLMs were first introduced, but only recently have the expectations of LLMs become more realistic. Verifying the correctness of code…

Software Engineering · Computer Science 2025-08-20 Zachariah Sollenberger , Rahul Patel , Saieda Ali Zada , Sunita Chandrasekaran

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation

Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. Prior work has shown the potential of…

Software Engineering · Computer Science 2026-03-04 Zi Lin , Sheng Shen , Ilia Kulikov , Jingbo Shang , Jason Weston , Yixin Nie

Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation

Large Language Models (LLMs) have become powerful tools for automated code generation. However, these models often overlook critical security practices, which can result in the generation of insecure code that contains…

Software Engineering · Computer Science 2025-07-01 Hao Yan , Swapneel Suhas Vaidya , Xiaokuan Zhang , Ziyu Yao

LEVER: Learning to Verify Language-to-Code Generation with Execution

The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases…

Machine Learning · Computer Science 2023-09-04 Ansong Ni , Srini Iyer , Dragomir Radev , Ves Stoyanov , Wen-tau Yih , Sida I. Wang , Xi Victoria Lin