Related papers: Tests4Py: A Benchmark for System Testing

BugsInPy: A Database of Existing Bugs in Python Programs to Enable Controlled Testing and Debugging Studies

The 2019 edition of Stack Overflow developer survey highlights that, for the first time, Python outperformed Java in terms of popularity. The gap between Python and Java further widened in the 2020 edition of the survey. Unfortunately,…

Software Engineering · Computer Science 2024-01-30 Ratnadira Widyasari , Sheng Qin Sim , Camellia Lok , Haodi Qi , Jack Phan , Qijin Tay , Constance Tan , Fiona Wee , Jodie Ethelda Tan , Yuheng Yieh , Brian Goh , Ferdian Thung , Hong Jin Kang , Thong Hoang , David Lo , Eng Lieh Ouh

Bugs4Q: A Benchmark of Real Bugs for Quantum Programs

Realistic benchmarks of reproducible bugs and fixes are vital to good experimental evaluation of debugging and testing approaches. However, there is no suitable benchmark suite that can systematically evaluate the debugging and testing…

Software Engineering · Computer Science 2021-09-22 Pengzhan Zhao , Jianjun Zhao , Zhongtao Miao , Shuhan Lan

Bugs in Machine Learning-based Systems: A Faultload Benchmark

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and…

Software Engineering · Computer Science 2023-01-18 Mohammad Mehdi Morovati , Amin Nikanjam , Foutse Khomh , Zhen Ming , Jiang

Mining Bug Repositories for Multi-Fault Programs

Datasets such as Defects4J and BugsInPy that contain bugs from real-world software projects are necessary for a realistic evaluation of automated debugging tools. However these datasets largely identify only a single bug in each entry,…

Software Engineering · Computer Science 2024-04-11 Dylan Callaghan , Bernd Fischer

Critical Review of BugSwarm for Fault Localization and Program Repair

Benchmarks play an important role in evaluating the efficiency and effectiveness of solutions to automate several phases of the software development lifecycle. Moreover, if well designed, they also serve us well as an important artifact to…

Software Engineering · Computer Science 2019-05-24 Thomas Durieux , Rui Abreu

PerfBench: Can Agents Resolve Real-World Performance Bugs?

Performance bugs are inefficiencies in software that waste computational resources without causing functional failures, making them particularly challenging to detect and fix. While recent advances in Software Engineering agents have shown…

Software Engineering · Computer Science 2025-12-04 Spandan Garg , Roshanak Zilouchian Moghaddam , Neel Sundaresan

Understanding Bug-Reproducing Tests: A First Empirical Study

Developers create bug-reproducing tests that support debugging by failing as long as the bug is present, and passing once the bug has been fixed. These tests are usually integrated into existing test suites and executed regularly alongside…

Software Engineering · Computer Science 2026-02-04 Andre Hora , Gordon Fraser

From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets

Software defect datasets, which are collections of software bugs, are essential resources to facilitate empirical research and enable standardized benchmarking for a wide range of software engineering techniques, including emerging areas…

Software Engineering · Computer Science 2026-02-12 Hao-Nan Zhu , Robert M. Furth , Michael Pradel , Cindy Rubio-González

Empirical Review of Java Program Repair Tools: A Large-Scale Experiment on 2,141 Bugs and 23,551 Repair Attempts

In the past decade, research on test-suite-based automatic program repair has grown significantly. Each year, new approaches and implementations are featured in major software engineering venues. However, most of those approaches are…

Software Engineering · Computer Science 2019-05-29 Thomas Durieux , Fernanda Madeiral , Matias Martinez , Rui Abreu

GitBug-Java: A Reproducible Benchmark of Recent Java Bugs

Bug-fix benchmarks are essential for evaluating methodologies in automatic program repair (APR) and fault localization (FL). However, existing benchmarks, exemplified by Defects4J, need to evolve to incorporate recent bug-fixes aligned with…

Software Engineering · Computer Science 2024-11-04 André Silva , Nuno Saavedra , Martin Monperrus

An Empirical Evaluation of Locally Deployed LLMs for Bug Detection in Python Code

Large language models (LLMs) have demonstrated strong performance on a wide range of software engineering tasks, including code generation and analysis. However, most prior work relies on cloud-based models or specialized hardware, limiting…

Software Engineering · Computer Science 2026-04-28 Jelena Ilić Vulićević

QBugs: A Collection of Reproducible Bugs in Quantum Algorithms and a Supporting Infrastructure to Enable Controlled Quantum Software Testing and Debugging Experiments

Reproducibility and comparability of empirical results are at the core tenet of the scientific method in any scientific field. To ease reproducibility of empirical studies, several benchmarks in software engineering research, such as…

Software Engineering · Computer Science 2021-04-01 José Campos , André Souto

BugScope: Learn to Find Bugs Like Human

Software auditing is an increasingly critical task in the era of rapid code generation. While LLM-based auditors have demonstrated strong potential, their effectiveness remains limited by misalignment with the highly complex,…

Software Engineering · Computer Science 2026-04-16 Jinyao Guo , Chengpeng Wang , Dominic Deluca , Jinjie Liu , Zhuo Zhang , Xiangyu Zhang

DebugBench: Evaluating Debugging Capability of Large Language Models

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs'…

Software Engineering · Computer Science 2024-06-07 Runchu Tian , Yining Ye , Yujia Qin , Xin Cong , Yankai Lin , Yinxu Pan , Yesai Wu , Haotian Hui , Weichuan Liu , Zhiyuan Liu , Maosong Sun

Categorizing Bugs with Social Networks: A Case Study on Four Open Source Software Communities

Efficient bug triaging procedures are an important precondition for successful collaborative software engineering projects. Triaging bugs can become a laborious task particularly in open source software (OSS) projects with a large base of…

Software Engineering · Computer Science 2013-03-04 Marcelo Serrano Zanetti , Ingo Scholtes , Claudio Juan Tessone , Frank Schweitzer

Real Faults in Deep Learning Fault Benchmarks: How Real Are They?

As the adoption of Deep Learning (DL) systems continues to rise, an increasing number of approaches are being proposed to test these systems, localise faults within them, and repair those faults. The best attestation of effectiveness for…

Software Engineering · Computer Science 2024-12-24 Gunel Jahangirova , Nargiz Humbatova , Jinhan Kim , Shin Yoo , Paolo Tonella

Towards Automated Performance Bug Identification in Python

Context: Software performance is a critical non-functional requirement, appearing in many fields such as mission critical applications, financial, and real time systems. In this work we focused on early detection of performance bugs; our…

Software Engineering · Computer Science 2017-02-28 Sokratis Tsakiltsidis , Andriy Miranskyy , Elie Mazzawi

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

Towards a Benchmark Set for Program Repair Based on Partial Fixes

Software bugs significantly contribute to software cost and increase the risk of system malfunctioning. In recent years, many automated program-repair approaches have been proposed to automatically fix undesired program behavior. Despite of…

Software Engineering · Computer Science 2021-07-19 Dirk Beyer , Lars Grunske , Thomas Lemberger , Minxing Tang

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the…

Software Engineering · Computer Science 2025-04-01 Max Hort , Leon Moonen