English
Related papers

Related papers: Measuring Coding Challenge Competence With APPS

200 papers

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can…

Software Engineering · Computer Science 2024-09-05 Mohammed Latif Siddiq , Simantika Dristi , Joy Saha , Joanna C. S. Santos

We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and…

Software Engineering · Computer Science 2025-02-11 Quinn Dougherty , Ronak Mehta

Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility's Multi-dimensional…

Software Engineering · Computer Science 2025-08-20 James Meaden , Michał Jarosz , Piotr Jodłowski , Grigori Melnik

Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating…

Automatic code generation is to generate the program code according to the given natural language description. The current mainstream approach uses neural networks to encode natural language descriptions, and output abstract syntax trees…

Software Engineering · Computer Science 2022-02-16 Maosheng Zhong , Gen Liu , Hongwei Li , Jiangling Kuang , Jinshan Zeng , Mingwen Wang

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…

Software Engineering · Computer Science 2024-09-10 Yi Cui

Large language models (LLMs) are being increasingly adopted for programming work. Prior work shows that while LLMs accelerate task completion for professional programmers, beginning programmers struggle to prompt models effectively.…

Software Engineering · Computer Science 2025-04-29 Yangtian Zi , Luisa Li , Arjun Guha , Carolyn Jane Anderson , Molly Q Feldman

Recent studies have adopted pre-trained language models, such as CodeT5 and CodeGPT, for automated program generation tasks like code generation, repair, and translation. Numerous language model-based approaches have been proposed and…

Software Engineering · Computer Science 2024-01-09 Yue Liu , Chakkrit Tantithamthavorn , Yonghui Liu , Li Li

Recent advancements in language modeling have enabled the translation of natural language into code, and the use of execution feedback to improve code generation. However, these methods often rely heavily on pre-existing test cases, which…

Software Engineering · Computer Science 2024-12-19 Nan Wang , Yafei Liu , Chen Chen , Haonan Lu

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Recent advancements in natural language processing \cite{gpt2} \cite{BERT} have led to near-human performance in multiple natural language tasks. In this paper, we seek to understand whether similar techniques can be applied to a highly…

Computation and Language · Computer Science 2021-02-23 Luis Perez , Lizi Ottens , Sudharshan Viswanathan

Mobile applications, often simply called "apps", are increasingly widespread, and we use them daily to perform a number of activities. Like all software, apps must be adequately tested to gain confidence that they behave correctly.…

Software Engineering · Computer Science 2015-04-01 Shauvik Roy Choudhary , Alessandra Gorla , Alessandro Orso

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and…

We introduce DS-1000, a code generation benchmark with a thousand data science problems spanning seven Python libraries, such as NumPy and Pandas. Compared to prior works, DS-1000 incorporates three core features. First, our problems…

Software Engineering · Computer Science 2022-11-22 Yuhang Lai , Chengxi Li , Yiming Wang , Tianyi Zhang , Ruiqi Zhong , Luke Zettlemoyer , Scott Wen-tau Yih , Daniel Fried , Sida Wang , Tao Yu

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

Recent Language Models (LMs) achieve breakthrough performance in code generation when trained on human-authored problems, even solving some competitive-programming problems. Self-play has proven useful in games such as Go, and thus it is…

Machine Learning · Computer Science 2023-04-13 Patrick Haluptzok , Matthew Bowers , Adam Tauman Kalai

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering…

Software Engineering · Computer Science 2025-11-07 Amir Molzam Sharifloo , Maedeh Heydari , Parsa Kazerooni , Daniel Maninger , Mira Mezini

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The…

Software Engineering · Computer Science 2024-08-02 Yi Cui
‹ Prev 1 2 3 10 Next ›