Related papers: WybeCoder: Verified Imperative Code Generation

VERINA: Benchmarking Verifiable Code Generation

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating…

Machine Learning · Computer Science 2026-03-18 Zhe Ye , Zhengxu Yan , Jingxuan He , Timothe Kasriel , Kaiyu Yang , Dawn Song

VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation

Large language models can generate useful code from natural language, but their outputs come without correctness guarantees. Verifiable code generation offers a path beyond testing by requiring models to produce not only executable code,…

Software Engineering · Computer Science 2026-05-12 Zichen Xie , Mrigank Pawagi , Yuxin Liu , Aaditi Rai , Lize Shao , John Berberian , Sicong Che , Wenxi Wang

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently…

Artificial Intelligence · Computer Science 2025-07-31 Aleksander Ficek , Somshubra Majumdar , Vahid Noroozi , Boris Ginsburg

Shrinking the Generation-Verification Gap with Weak Verifiers

Verifiers can improve language model capabilities by scoring and ranking responses from generated candidates. Currently, high-quality verifiers are either unscalable (e.g., humans) or limited in utility (e.g., tools like Lean). While LM…

Computation and Language · Computer Science 2025-12-10 Jon Saad-Falcon , E. Kelly Buchanan , Mayee F. Chen , Tzu-Heng Huang , Brendan McLaughlin , Tanvir Bhathal , Shang Zhu , Ben Athiwaratkun , Frederic Sala , Scott Linderman , Azalia Mirhoseini , Christopher Ré

EffiCoder: Enhancing Code Generation in Large Language Models through Efficiency-Aware Fine-tuning

As large language models (LLMs) play an increasingly important role in code generation, enhancing both correctness and efficiency has become crucial. Current methods primarily focus on correctness, often overlooking efficiency. To address…

Computation and Language · Computer Science 2025-06-17 Dong Huang , Guangtao Zeng , Jianbo Dai , Meng Luo , Han Weng , Yuhao Qing , Heming Cui , Zhijiang Guo , Jie M. Zhang

Scaling Agentic Verifier for Competitive Coding

Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy,…

Computation and Language · Computer Science 2026-02-05 Zeyao Ma , Jing Zhang , Xiaokang Zhang , Jiaxi Yang , Zongmeng Zhang , Jiajun Zhang , Yuheng Jing , Lei Zhang , Hao Zheng , Wenting Zhao , Junyang Lin , Binyuan Hui

VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

Formal verification is the next frontier for ensuring the correctness of code generated by Large Language Models (LLMs). While methods that co-generate code and formal specifications in formal languages, like Dafny, can, in principle, prove…

Programming Languages · Computer Science 2026-04-21 Lingfei Zeng , Fengdi Che , Xuhan Huang , Fei Ye , Xu Xu , Binhang Yuan , Jie Fu

Automating Formal Verification with Agent-Guided Tree Search

Formal verification offers a path to provably correct software, but writing verified code remains expensive enough that the technique is rarely used in production. Recent large language models can accelerate this work, and recent benchmarks…

Logic in Computer Science · Computer Science 2026-05-28 Leo Yao

VisCoder2: Building Multi-Language Visualization Coding Agents

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable…

Software Engineering · Computer Science 2026-04-09 Yuansheng Ni , Songcheng Cai , Xiangchao Chen , Jiarong Liang , Zhiheng Lyu , Jiaqi Deng , Kai Zou , Ping Nie , Fei Yuan , Xiang Yue , Wenhu Chen

AlphaVerus: Bootstrapping Formally Verified Code Generation through Self-Improving Translation and Treefinement

Automated code generation with large language models has gained significant traction, but there remains no guarantee on the correctness of generated code. We aim to use formal verification to provide mathematical guarantees that the…

Machine Learning · Computer Science 2024-12-10 Pranjal Aggarwal , Bryan Parno , Sean Welleck

CLEVER: A Curated Benchmark for Formally Verified Code Generation

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out…

Machine Learning · Computer Science 2025-10-24 Amitayush Thakur , Jasper Lee , George Tsoukalas , Meghana Sistla , Matthew Zhao , Stefan Zetzsche , Greg Durrett , Yisong Yue , Swarat Chaudhuri

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs. To improve…

Software Engineering · Computer Science 2025-10-08 Xun Deng , Sicheng Zhong , Barış Bayazıt , Andreas Veneris , Fan Long , Xujie Si

Goedel-Code-Prover: Hierarchical Proof Search for Open State-of-the-Art Code Verification

Large language models (LLMs) can generate plausible code but offer limited guarantees of correctness. Formally verifying that implementations satisfy specifications requires constructing machine-checkable proofs, a task that remains beyond…

Software Engineering · Computer Science 2026-03-30 Zenan Li , Ziran Yang , Deyuan He , Haoyu Zhao , Andrew Zhao , Shange Tang , Kaiyu Yang , Aarti Gupta , Zhendong Su , Chi Jin

IFEvalCode: Controlled Code Generation

Code large language models (Code LLMs) have made significant progress in code generation by translating natural language descriptions into functional code; however, real-world applications often demand stricter adherence to detailed…

Computation and Language · Computer Science 2025-08-04 Jian Yang , Wei Zhang , Shukai Liu , Linzheng Chai , Yingshui Tan , Jiaheng Liu , Ge Zhang , Wangchunshu Zhou , Guanglin Niu , Zhoujun Li , Binyuan Hui , Junyang Lin

Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation

Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. Prior work has shown the potential of…

Software Engineering · Computer Science 2026-03-04 Zi Lin , Sheng Shen , Ilia Kulikov , Jingbo Shang , Jason Weston , Yixin Nie

Verifiable Reasoning for LLM-based Generative Recommendation

Reasoning in Large Language Models (LLMs) has recently shown strong potential in enhancing generative recommendation through deep understanding of complex user preference. Existing approaches follow a {reason-then-recommend} paradigm, where…

Information Retrieval · Computer Science 2026-03-10 Xinyu Lin , Hanqing Zeng , Hanchao Yu , Yinglong Xia , Jiang Zhang , Aashu Singh , Fei Liu , Wenjie Wang , Fuli Feng , Tat-Seng Chua , Qifan Wang

Klear-CodeTest: Scalable Test Case Generation for Code Reinforcement Learning

Precise, correct feedback is crucial for effectively training large language models (LLMs) in code reinforcement learning. However, synthesizing high-quality test cases remains a profoundly challenging and unsolved problem. In this work, we…

Software Engineering · Computer Science 2025-09-12 Jia Fu , Xinyu Yang , Hongzhi Zhang , Yahui Liu , Jingyuan Zhang , Qi Wang , Fuzheng Zhang , Guorui Zhou

VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational…

Computation and Language · Computer Science 2025-08-22 Jiuzhou Han , Wray Buntine , Ehsan Shareghi

LEVER: Learning to Verify Language-to-Code Generation with Execution

The advent of large language models trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases…

Machine Learning · Computer Science 2023-09-04 Ansong Ni , Srini Iyer , Dragomir Radev , Ves Stoyanov , Wen-tau Yih , Sida I. Wang , Xi Victoria Lin