Related papers: Reproduction Test Generation for Java SWE Issues

Heterogeneous Prompting and Execution Feedback for SWE Issue Test Generation and Selection

A software engineering issue (SWE issue) is easier to resolve when accompanied by a reproduction test. Unfortunately, most issues do not come with functioning reproduction tests, so this paper explores how to generate them automatically.…

Software Engineering · Computer Science 2026-01-26 Toufique Ahmed , Jatin Ganhotra , Avraham Shinnar , Martin Hirzel

Otter: Generating Tests from Issues to Validate SWE Patches

While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. This paper focuses on the scenario…

Software Engineering · Computer Science 2025-06-02 Toufique Ahmed , Jatin Ganhotra , Rangeet Pan , Avraham Shinnar , Saurabh Sinha , Martin Hirzel

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root…

Software Engineering · Computer Science 2026-01-21 Aditya Bharat Soni , Rajat Ghosh , Vaishnavi Bhargava , Valerie Chen , Debojyoti Dutta

GitBug-Java: A Reproducible Benchmark of Recent Java Bugs

Bug-fix benchmarks are essential for evaluating methodologies in automatic program repair (APR) and fault localization (FL). However, existing benchmarks, exemplified by Defects4J, need to evolve to incorporate recent bug-fixes aligned with…

Software Engineering · Computer Science 2024-11-04 André Silva , Nuno Saavedra , Martin Monperrus

Echo: Graph-Enhanced Retrieval and Execution Feedback for Issue Reproduction Test Generation

Identifying the root cause of a bug remains difficult for many developers because bug reports often lack a bug reproducing test case that reliably triggers the failure. Manually writing such test cases is time-consuming and requires…

Software Engineering · Computer Science 2026-03-10 Zhiwei Fei , Yue Pan , Federica Sarro , Jidong Ge , Marc Liu , Vincent Ng , He Ye

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large…

Software Engineering · Computer Science 2024-08-27 Daoguang Zan , Zhirong Huang , Ailun Yu , Shaoxin Lin , Yifan Shi , Wei Liu , Dong Chen , Zongshuai Qi , Hao Yu , Lei Yu , Dezhi Ran , Muhan Zeng , Bo Shen , Pan Bian , Guangtai Liang , Bei Guan , Pengjie Huang , Tao Xie , Yongji Wang , Qianxiang Wang

Automating Test Case Identification in Java Open Source Projects on GitHub

Software testing is one of the very important Quality Assurance (QA) components. A lot of researchers deal with the testing process in terms of tester motivation and how tests should or should not be written. However, it is not known from…

Software Engineering · Computer Science 2022-01-04 Matej Madeja , Jaroslav Porubän , Michaela Bačíková , Matúš Sulír , Ján Juhár , Sergej Chodarev , Filip Gurbáľ

RTj: a Java framework for detecting and refactoring rotten green test cases

Rotten green tests are passing tests which have, at least, one assertion not executed. They give developers a false confidence. In this paper, we present, RTj, a framework that analyzes test cases from Java projects with the goal of…

Software Engineering · Computer Science 2019-12-17 Matias Martinez , Anne Etien , Stéphane Ducasse , Christopher Fuhrman

Can Old Tests Do New Tricks for Resolving SWE Issues?

Test suites in real-world projects are often large and achieve high code coverage, yet they remain insufficient for detecting all bugs. The abundance of unresolved issues in open-source project trackers highlights this gap. While regression…

Software Engineering · Computer Science 2026-05-12 Yang Chen , Toufique Ahmed , Reyhaneh Jabbarvand , Martin Hirzel

The Java Build Framework: Large Scale Compilation

Large repositories of source code for research tend to limit their utility to static analysis of the code, as they give no guarantees on whether the projects are compilable, much less runnable in any way. The immediate consequence of the…

Software Engineering · Computer Science 2018-04-13 Pedro Martins , Rohan Achar , Cristina V. Lopes

TestGenEval: A Real World Unit Test Generation and Test Completion Benchmark

Code generation models can help improve many common software tasks ranging from code completion to defect prediction. Most of the existing benchmarks for code generation LLMs focus on code authoring or code completion. Surprisingly, there…

Software Engineering · Computer Science 2025-03-20 Kush Jain , Gabriel Synnaeve , Baptiste Rozière

TDD-Bench Verified: Can LLMs Generate Tests for Issues Before They Get Resolved?

Test-driven development (TDD) is the practice of writing tests first and coding later, and the proponents of TDD expound its numerous benefits. For instance, given an issue on a source code repository, tests can clarify the desired behavior…

Software Engineering · Computer Science 2024-12-05 Toufique Ahmed , Martin Hirzel , Rangeet Pan , Avraham Shinnar , Saurabh Sinha

The Reproducibility of Programming-Related Issues in Stack Overflow Questions

Software developers often look for solutions to their code-level problems using the Stack Overflow Q&A website. To receive help, developers frequently submit questions containing sample code segments and the description of the programming…

Software Engineering · Computer Science 2021-12-28 Saikat Mondal , Mohammad Masudur Rahman , Chanchal K. Roy , Kevin Schneider

Reproducible Automated Program Repair Is Hard -- Experiences With the Defects4J Dataset

In the research of automated program repair (APR), benchmark datasets consisting of known defects in combination with test suites that indicate the defects are of high importance. They allow for an evidence-based comparison of different APR…

Software Engineering · Computer Science 2026-04-30 Adam Krafczyk , Klaus Schmid

EvoSpex: An Evolutionary Algorithm for Learning Postconditions

Software reliability is a primary concern in the construction of software, and thus a fundamental component in the definition of software quality. Analyzing software reliability requires a specification of the intended behavior of the…

Software Engineering · Computer Science 2021-03-02 Facundo Molina , Pablo Ponzio , Nazareno Aguirre , Marcelo Frias

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Software Engineering · Computer Science 2025-12-22 Lilin Wang , Lucas Ramalho , Alan Celestino , Phuc Anthony Pham , Yu Liu , Umang Kumar Sinha , Andres Portillo , Onassis Osunwa , Gabriel Maduekwe

Bears: An Extensible Java Bug Benchmark for Automatic Program Repair Studies

Benchmarks of bugs are essential to empirically evaluate automatic program repair tools. In this paper, we present Bears, a project for collecting and storing bugs into an extensible bug benchmark for automatic repair studies in Java. The…

Software Engineering · Computer Science 2019-04-04 Fernanda Madeiral , Simon Urli , Marcelo Maia , Martin Monperrus

BUMP: A Benchmark of Reproducible Breaking Dependency Updates

Third-party dependency updates can cause a build to fail if the new dependency version introduces a change that is incompatible with the usage: this is called a breaking dependency update. Research on breaking dependency updates is active,…

Software Engineering · Computer Science 2024-03-21 Frank Reyes , Yogya Gamage , Gabriel Skoglund , Benoit Baudry , Martin Monperrus

Automated Test Generation from Program Documentation Encoded in Code Comments

Documenting the functionality of software units with code comments, e.g., Javadoc comments, is a common programmer best-practice in software engineering. This paper introduces a novel test generation technique that exploits the code-comment…

Software Engineering · Computer Science 2025-05-01 Giovanni Denaro , Luca Guglielmo

1-2-3 Reproducibility for Quantum Software Experiments

Various fields of science face a reproducibility crisis. For quantum software engineering as an emerging field, it is therefore imminent to focus on proper reproducibility engineering from the start. Yet the provision of reproduction…

Software Engineering · Computer Science 2022-01-31 Wolfgang Mauerer , Stefanie Scherzinger