Related papers: SWT-Bench: Testing and Validating Real-World Bug-F…

An Empirical Study on LLM-based Agents for Automated Bug Fixing

Large language models (LLMs) and LLM-based Agents have been applied to fix bugs automatically, demonstrating the capability in addressing software defects by engaging in development environment interaction, iterative validation and code…

Software Engineering · Computer Science 2025-10-21 Xiangxin Meng , Zexiong Ma , Pengfei Gao , Chao Peng

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, but the value of…

Software Engineering · Computer Science 2026-04-10 Zhi Chen , Zhensu Sun , Yuling Shi , Chao Peng , Xiaodong Gu , David Lo , Lingxiao Jiang

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world…

Software Engineering · Computer Science 2025-06-12 Boxi Yu , Yuxuan Zhu , Pinjia He , Daniel Kang

Can Agents Fix Agent Issues?

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are…

Artificial Intelligence · Computer Science 2025-10-27 Alfin Wijaya Rahardja , Junwei Liu , Weitong Chen , Zhenpeng Chen , Yiling Lou

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test…

Computation and Language · Computer Science 2026-01-15 Yuhan Cao , Zian Chen , Kun Quan , Ziliang Zhang , Yu Wang , Xiaoning Dong , Yeqi Feng , Guanzhong He , Jingcheng Huang , Jianhao Li , Yixuan Tan , Jiafu Tang , Yilin Tang , Junlei Wu , Qianyu Xiao , Can Zheng , Shouchen Zhou , Yuxiang Zhu , Yiming Huang , Tianxing He

Automatic Generation of Test Cases based on Bug Reports: a Feasibility Study with Large Language Models

Software testing is a core discipline in software engineering where a large array of research results has been produced, notably in the area of automatic test generation. Because existing approaches produce test cases that either can be…

Software Engineering · Computer Science 2023-10-11 Laura Plein , Wendkûuni C. Ouédraogo , Jacques Klein , Tegawendé F. Bissyandé

SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs

Large language models (LLMs) are transforming automated program repair (APR) through agent-based approaches that localize bugs, generate patches, and verify fixes. However, the lack of high-quality, scalable training datasets, especially…

Software Engineering · Computer Science 2025-12-23 Minh V. T. Pham , Huy N. Phan , Hoang N. Phan , Cuong Le Chi , Tien N. Nguyen , Nghi D. Q. Bui

SWE-bench Goes Live!

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in…

Software Engineering · Computer Science 2025-06-03 Linghao Zhang , Shilin He , Chaoyun Zhang , Yu Kang , Bowen Li , Chengxing Xie , Junhao Wang , Maoquan Wang , Yufan Huang , Shengyu Fu , Elsie Nallipogu , Qingwei Lin , Yingnong Dang , Saravan Rajmohan , Dongmei Zhang

Rethinking Verification for LLM Code Generation: From Generation to Testing

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number…

Computation and Language · Computer Science 2025-07-11 Zihan Ma , Taolin Zhang , Maosong Cao , Junnan Liu , Wenwei Zhang , Minnan Luo , Songyang Zhang , Kai Chen

A Self-Improving Coding Agent

Recent advancements in Large Language Models (LLMs) have spurred interest in deploying LLM agents to undertake tasks in the world. LLMs are often deployed in agent systems: code that orchestrates LLM calls and provides them with tools. We…

Artificial Intelligence · Computer Science 2025-05-20 Maxime Robeyns , Martin Szummer , Laurence Aitchison

SWE-Tester: Training Open-Source LLMs for Issue Reproduction in Real-World Repositories

Software testing is crucial for ensuring the correctness and reliability of software systems. Automated generation of issue reproduction tests from natural language issue descriptions enhances developer productivity by simplifying root…

Software Engineering · Computer Science 2026-01-21 Aditya Bharat Soni , Rajat Ghosh , Vaishnavi Bhargava , Valerie Chen , Debojyoti Dutta

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle.…

Software Engineering · Computer Science 2026-03-16 Greta Dolcetti , Vincenzo Arceri , Eleonora Iotti , Sergio Maffeis , Agostino Cortesi , Enea Zaffanella

MarsCode Agent: AI-native Automated Bug Fixing

Recent advances in large language models (LLMs) have shown significant potential to automate various software development tasks, including code completion, test generation, and bug fixing. However, the application of LLMs for automated bug…

Software Engineering · Computer Science 2024-09-05 Yizhou Liu , Pengfei Gao , Xinchen Wang , Jie Liu , Yexuan Shi , Zhao Zhang , Chao Peng

RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area. Existing benchmarks often fall short by relying on synthetic…

Cryptography and Security · Computer Science 2026-02-02 Yanlin Wang , Ziyao Zhang , Chong Wang , Xinyi Xu , Mingwei Liu , Yong Wang , Jiachi Chen , Zibin Zheng

OSS-Bench: Benchmark Generator for Coding LLMs

In light of the rapid adoption of AI coding assistants, LLM-assisted development has become increasingly prevalent, creating an urgent need for robust evaluation of generated code quality. Existing benchmarks often require extensive manual…

Software Engineering · Computer Science 2025-05-21 Yuancheng Jiang , Roland Yap , Zhenkai Liang

AI-powered Code Review with LLMs: Early Results

In this paper, we present a novel approach to improving software quality and efficiency through a Large Language Model (LLM)-based model designed to review code and identify potential issues. Our proposed LLM-based AI agent model is trained…

Software Engineering · Computer Science 2025-12-11 Zeeshan Rasheed , Malik Abdul Sami , Muhammad Waseem , Kai-Kristian Kemell , Xiaofeng Wang , Anh Nguyen , Kari Systä , Pekka Abrahamsson

Agentless: Demystifying LLM-based Software Engineering Agents

Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry…

Software Engineering · Computer Science 2024-10-30 Chunqiu Steven Xia , Yinlin Deng , Soren Dunn , Lingming Zhang

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code

The code generation capabilities of large language models(LLMs) have emerged as a critical dimension in evaluating their overall performance. However, prior research has largely overlooked the security risks inherent in the generated code.…

Cryptography and Security · Computer Science 2025-06-23 Xinghang Li , Jingzhe Ding , Chao Peng , Bing Zhao , Xiang Gao , Hongwan Gao , Xinchen Gu

SWE-Adept: An LLM-Based Agentic Framework for Deep Codebase Analysis and Structured Issue Resolution

Large language models (LLMs) exhibit strong performance on self-contained programming tasks. However, they still struggle with repository-level software engineering (SWE), which demands (1) deep codebase navigation with effective context…

Software Engineering · Computer Science 2026-05-27 Kang He , Kaushik Roy