Related papers: Breaking, Stale, or Missing? Benchmarking Coding A…

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

FeatBench: Towards More Realistic Evaluation of Feature-level Code Generation

Evaluating Large Language Models (LLMs) on repository-level feature implementation is a critical frontier in software engineering. However, establishing a benchmark that faithfully mirrors realistic development scenarios remains a…

Computation and Language · Computer Science 2026-02-19 Haorui Chen , Chengze Li , Jia Li

FeatureBench: Benchmarking Agentic Coding for Complex Feature Development

Agents powered by large language models (LLMs) are increasingly adopted in the software industry, contributing code as collaborators or even autonomous developers. As their presence grows, it becomes important to assess the current…

Software Engineering · Computer Science 2026-02-12 Qixing Zhou , Jiacheng Zhang , Haiyang Wang , Rui Hao , Jiahe Wang , Minghao Han , Yuxue Yang , Shuzhe Wu , Feiyang Pan , Lue Fan , Dandan Tu , Zhaoxiang Zhang

Converted, Not Equivalent: Benchmarking Codebase Conversion via Observational Equivalence

Coding agents increasingly act as codebase-scale collaborators that can assist with codebase conversion, but this progress has exposed a critical weakness: agents often over-trust their own local validation routines and declare success on…

Software Engineering · Computer Science 2026-05-29 Linxin Song , Jiefeng Chen , Yue Huang , Bhavana Dalvi Mishra , Chi Wang , Jieyu Zhao , Jinsung Yoon , Tomas Pfister

Automated Benchmark Generation for Repository-Level Coding Tasks

Code Agent development is an extremely active research area, where a reliable performance metric is critical for tracking progress and guiding new developments. This demand is underscored by the meteoric rise in popularity of SWE-Bench.…

Software Engineering · Computer Science 2025-03-12 Konstantinos Vergopoulos , Mark Niklas Müller , Martin Vechev

BenchBench: Benchmarking Automated Benchmark Generation

Benchmarks are the de facto standard for tracking progress in large language models (LLMs), yet static test sets can rapidly saturate, become vulnerable to contamination, and are costly to refresh. Scalable evaluation of open-ended items…

Computation and Language · Computer Science 2026-03-24 Yandan Zheng , Haoran Luo , Zhenghong Lin , Wenjin Liu , Luu Anh Tuan

PerfBench: Can Agents Resolve Real-World Performance Bugs?

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown…

Software Engineering · Computer Science 2025-12-04 Spandan Garg , Roshanak Zilouchian Moghaddam , Neel Sundaresan

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

Vibe Code Bench: Evaluating AI Models on End-to-End Web Application Development

Code generation has emerged as one of AI's highest-impact use cases, yet existing benchmarks measure isolated tasks rather than the complete "zero-to-one" process of building a working application from scratch. We introduce Vibe Code Bench,…

Software Engineering · Computer Science 2026-05-15 Hung Tran , Langston Nashold , Rayan Krishnan , Antoine Bigeard , Alex Gu

Assessing requirements engineering and software test alignment -- Five case studies

The development of large, software-intensive systems is a complex undertaking that we generally tackle by a divide and conquer strategy. Companies thereby face the challenge of coordinating individual aspects of software development, in…

Software Engineering · Computer Science 2023-08-16 Michael Unterkalmsteiner , Tony Gorschek , Robert Feldt , Eriks Klotins

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection

Identifying vulnerabilities in source code is crucial, especially in critical software components. Existing methods such as static analysis, dynamic analysis, formal verification, and recently Large Language Models are widely used to detect…

Cryptography and Security · Computer Science 2025-04-01 Richard A. Dubniczky , Krisztofer Zoltán Horvát , Tamás Bisztray , Mohamed Amine Ferrag , Lucas C. Cordeiro , Norbert Tihanyi

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Are Benchmark Tests Strong Enough? Mutation-Guided Diagnosis and Augmentation of Regression Suites

Benchmarks driven by test suites, notably SWE-bench, have become the de facto standard for measuring the effectiveness of automated issue-resolution agents: a generated patch is accepted whenever it passes the accompanying regression tests.…

Software Engineering · Computer Science 2026-04-03 Chenglin Li , Yisen Xu , Zehao Wang , Shin Hwei Tan , Tse-Hsun , Chen

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

AL-Bench: A Benchmark for Automatic Logging

Logging, the practice of inserting log statements into source code, is critical for improving software reliability. Recently, language model-based techniques have been developed to automate log statement generation based on input code.…

Software Engineering · Computer Science 2025-04-03 Boyin Tan , Junjielong Xu , Zhouruixing Zhu , Pinjia He

EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the…

Computation and Language · Computer Science 2024-10-31 Jia Li , Ge Li , Xuanming Zhang , Yunfei Zhao , Yihong Dong , Zhi Jin , Binhua Li , Fei Huang , Yongbin Li

ScarfBench: A Benchmark for Cross-Framework Application Migration in Enterprise Java

Java remains central to enterprise software, and many applications outlive their original architecture. Migrating them across frameworks is a behavior-preserving refactoring spanning build configuration, dependency injection, persistence,…

Software Engineering · Computer Science 2026-05-19 Advait Pavuluri , Bridget McGinn , Ashita Saxena , George Safta , Srikanth Tamilselvam , Raju Pavuluri , Michele Merler , Baishakhi Ray , Rahul Krishna

ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

Autonomous agents are increasingly expected to support scientific research, and recent benchmarks report progress in code repair and autonomous experimentation. However, these evaluations typically assume a pre-configured execution…

Software Engineering · Computer Science 2026-03-12 Yubang Wang , Chenxi Zhang , Bowen Chen , Zezheng Huai , Zihao Dai , Xinchi Chen , Yuxin Wang , Yining Zheng , Jingjing Gong , Xipeng Qiu

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam