Related papers: Measuring Coding Challenge Competence With APPS

The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can…

Software Engineering · Computer Science 2024-09-05 Mohammed Latif Siddiq , Simantika Dristi , Joy Saha , Joanna C. S. Santos

Proving the Coding Interview: A Benchmark for Formally Verified Code Generation

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and…

Software Engineering · Computer Science 2025-02-11 Quinn Dougherty , Ronak Mehta

COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models

Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional…

Software Engineering · Computer Science 2025-08-20 James Meaden , Michał Jarosz , Piotr Jodłowski , Grigori Melnik

Competition-Level Code Generation with AlphaCode

Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating…

Programming Languages · Computer Science 2023-01-11 Yujia Li , David Choi , Junyoung Chung , Nate Kushman , Julian Schrittwieser , Rémi Leblond , Tom Eccles , James Keeling , Felix Gimeno , Agustin Dal Lago , Thomas Hubert , Peter Choy , Cyprien de Masson d'Autume , Igor Babuschkin , Xinyun Chen , Po-Sen Huang , Johannes Welbl , Sven Gowal , Alexey Cherepanov , James Molloy , Daniel J. Mankowitz , Esme Sutherland Robson , Pushmeet Kohli , Nando de Freitas , Koray Kavukcuoglu , Oriol Vinyals

CodeGen-Test: An Automatic Code Generation Model Integrating Program Test Information

Automatic code generation is to generate the program code according to the given natural language description. The current mainstream approach uses neural networks to encode natural language descriptions, and output abstract syntax trees…

Software Engineering · Computer Science 2022-02-16 Maosheng Zhong , Gen Liu , Hongwei Li , Jiangling Kuang , Jinshan Zeng , Mingwen Wang

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

Insights from Benchmarking Frontier Language Models on Web App Code Generation

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…

Software Engineering · Computer Science 2024-09-10 Yi Cui

"I Would Have Written My Code Differently'': Beginners Struggle to Understand LLM-Generated Code

Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively.…

Software Engineering · Computer Science 2025-04-29 Yangtian Zi , Luisa Li , Arjun Guha , Carolyn Jane Anderson , Molly Q Feldman

On the Reliability and Explainability of Language Models for Program Generation

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and…

Software Engineering · Computer Science 2024-01-09 Yue Liu , Chakkrit Tantithamthavorn , Yonghui Liu , Li Li

GenX: Mastering Code and Test Generation with Execution Feedback

Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which…

Software Engineering · Computer Science 2024-12-19 Nan Wang , Yafei Liu , Chen Chen , Haonan Lu

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Automatic Code Generation using Pre-Trained Language Models

Recent advancements in natural language processing \cite{gpt2} \cite{BERT} have led to near-human performance in multiple natural language tasks. In this paper, we seek to understand whether similar techniques can be applied to a highly…

Computation and Language · Computer Science 2021-02-23 Luis Perez , Lizi Ottens , Sudharshan Viswanathan

Automated Test Input Generation for Android: Are We There Yet?

Mobile applications, often simply called "apps", are increasingly widespread, and we use them daily to perform a number of activities. Like all software, apps must be adequately tested to gain confidence that they behave correctly.…

Software Engineering · Computer Science 2015-04-01 Shauvik Roy Choudhary , Alessandra Gorla , Alessandro Orso

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and…

Software Engineering · Computer Science 2025-10-20 Shihan Dou , Haoxiang Jia , Shenxi Wu , Huiyuan Zheng , Muling Wu , Yunbo Tao , Ming Zhang , Mingxu Chai , Jessica Fan , Zhiheng Xi , Rui Zheng , Yueming Wu , Ming Wen , Tao Gui , Qi Zhang , Xipeng Qiu , Xuanjing Huang

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems…

Software Engineering · Computer Science 2022-11-22 Yuhang Lai , Chengxi Li , Yiming Wang , Tianyi Zhang , Ruiqi Zhong , Luke Zettlemoyer , Scott Wen-tau Yih , Daniel Fried , Sida Wang , Tao Yu

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

Language Models Can Teach Themselves to Program Better

Recent Language Models (LMs) achieve breakthrough performance in code generation when trained on human-authored problems, even solving some competitive-programming problems. Self-play has proven useful in games such as Go, and thus it is…

Machine Learning · Computer Science 2023-04-13 Patrick Haluptzok , Matthew Bowers , Adam Tauman Kalai

Where Do LLMs Still Struggle? An In-Depth Analysis of Code Generation Benchmarks

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering…

Software Engineering · Computer Science 2025-11-07 Amir Molzam Sharifloo , Maedeh Heydari , Parsa Kazerooni , Daniel Maninger , Mira Mezini

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The…

Software Engineering · Computer Science 2024-08-02 Yi Cui