Related papers: CodeBenchGen: Creating Scalable Execution-based Co…

CodeSpecBench: Benchmarking LLMs for Executable Behavioral Specification Generation

Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions,…

Software Engineering · Computer Science 2026-04-15 Zaoyu Chen , Jianbo Dai , Boyu Zhu , Jingdong Wang , Huiming Wang , Xin Xu , Haoyang Yuan , Zhijiang Guo , Xiao-Ming Wu

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

AixBench: A Code Generation Benchmark Dataset

We present a benchmark dataset for evaluating method-level code generation task. The benchmark contains a dataset of 175 samples for automated evaluation and a dataset of 161 samples for manual evaluation. We also present a new metric for…

Software Engineering · Computer Science 2022-07-22 Yiyang Hao , Ge Li , Yongqiang Liu , Xiaowei Miao , He Zong , Siyuan Jiang , Yang Liu , He Wei

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

ExecRepoBench: Multi-level Executable Code Completion Evaluation

Code completion has become an essential tool for daily software development. Existing evaluation benchmarks often employ static methods that do not fully capture the dynamic nature of real-world coding environments and face significant…

Computation and Language · Computer Science 2024-12-17 Jian Yang , Jiajun Zhang , Jiaxi Yang , Ke Jin , Lei Zhang , Qiyao Peng , Ken Deng , Yibo Miao , Tianyu Liu , Zeyu Cui , Binyuan Hui , Junyang Lin

EVM-QuestBench: An Execution-Grounded Benchmark for Natural-Language Transaction Code Generation

Large language models are increasingly applied to various development scenarios. However, in on-chain transaction scenarios, even a minor error can cause irreversible loss for users. Existing evaluations often overlook execution accuracy…

Computation and Language · Computer Science 2026-04-08 Pei Yang , Wanyi Chen , Ke Wang , Lynn Ai , Eric Yang , Tianyu Shi

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Modern software development demands code that is maintainable, testable, and scalable by organizing the implementation into modular components with iterative reuse of existing codes. We formalize this iterative, multi-turn paradigm as…

Software Engineering · Computer Science 2026-04-16 Sizhe Wang , Zhengren Wang , Dongsheng Ma , Yongan Yu , Rui Ling , Zhiyu Li , Feiyu Xiong , Wentao Zhang

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging…

Software Engineering · Computer Science 2026-03-17 Chenxu Liu , Yingjie Fu , Wei Yang , Ying Zhang , Tao Xie

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Prompts

Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory…

Computation and Language · Computer Science 2024-05-08 Shudan Zhang , Hanlin Zhao , Xiao Liu , Qinkai Zheng , Zehan Qi , Xiaotao Gu , Xiaohan Zhang , Yuxiao Dong , Jie Tang

Execution-based Evaluation for Data Science Code Generation Models

Code generation models can benefit data scientists' productivity by automatically generating code from context and text descriptions. An important measure of the modeling progress is whether a model can generate code that can correctly…

Software Engineering · Computer Science 2022-11-18 Junjie Huang , Chenglong Wang , Jipeng Zhang , Cong Yan , Haotian Cui , Jeevana Priya Inala , Colin Clement , Nan Duan , Jianfeng Gao

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by…

Computation and Language · Computer Science 2024-10-17 Haau-Sing Li , Patrick Fernandes , Iryna Gurevych , André F. T. Martins

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities…

Computation and Language · Computer Science 2024-06-10 Weixiang Yan , Haitian Liu , Yunkun Wang , Yunzhe Li , Qian Chen , Wen Wang , Tingyu Lin , Weishan Zhao , Li Zhu , Hari Sundaram , Shuiguang Deng

A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks

Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the…

Software Engineering · Computer Science 2024-11-26 Rohit Dandamudi , Gema Rodríguez-Pérez