Related papers: DebugBench: Evaluating Debugging Capability of Lar…

Debugging with Open-Source Large Language Models: An Evaluation

Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many…

Software Engineering · Computer Science 2024-09-06 Yacine Majdoub , Eya Ben Charrada

MdEval: Massively Multilingual Code Debugging

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

Computation and Language · Computer Science 2025-02-25 Shukai Liu , Linzheng Chai , Jian Yang , Jiajun Shi , He Zhu , Liran Wang , Ke Jin , Wei Zhang , Hualei Zhu , Shuyue Guo , Tao Sun , Jiaheng Liu , Yunlong Duan , Yu Hao , Liqun Yang , Guanglin Niu , Ge Zhang , Zhoujun Li

Leveraging Print Debugging to Improve Code Generation in Large Language Models

Large language models (LLMs) have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose…

Computation and Language · Computer Science 2024-01-11 Xueyu Hu , Kun Kuang , Jiankai Sun , Hongxia Yang , Fei Wu

RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

Large Language Models (LLMs) have exhibited significant proficiency in code debugging, especially in automatic program repair, which may substantially reduce the time consumption of developers and enhance their efficiency. Significant…

Software Engineering · Computer Science 2025-09-09 Jingjing Liu , Zeming Liu , Zihao Cheng , Mengliang He , Xiaoming Shi , Yuhang Guo , Xiangrong Zhu , Yuanfang Guo , Yunhong Wang , Haifeng Wang

MLDebugging: Towards Benchmarking Code Debugging Across Multi-Library Scenarios

Code debugging is a crucial task in software engineering, which attracts increasing attention. While remarkable success has been made in the era of large language models (LLMs), current research still focuses on the simple no-library or…

Software Engineering · Computer Science 2025-06-18 Jinyang Huang , Xiachong Feng , Qiguang Chen , Hanjie Zhao , Zihui Cheng , Jiesong Bai , Jingxuan Zhou , Min Li , Libo Qin

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Software Engineering · Computer Science 2025-04-09 Jiawei Guo , Ziming Li , Xueling Liu , Kaijing Ma , Tianyu Zheng , Zhouliang Yu , Ding Pan , Yizhi LI , Ruibo Liu , Yue Wang , Shuyue Guo , Xingwei Qu , Xiang Yue , Ge Zhang , Wenhu Chen , Jie Fu

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly…

Software Engineering · Computer Science 2023-11-10 Sungmin Kang , Juyeon Yoon , Nargiz Askarbekkyzy , Shin Yoo

Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

Large Language Models (LLMs) such as ChatGPT-4, Claude 3, and LLaMA 4 are increasingly embedded in software/application development, supporting tasks from code generation to debugging. Yet, their real-world effectiveness in detecting…

Software Engineering · Computer Science 2026-04-28 Akshay Mhatre , Noujoud Nader , Patrick Diehl , Deepti Gupta

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program…

Cryptography and Security · Computer Science 2024-08-22 Yu Liu , Lang Gao , Mingxin Yang , Yu Xie , Ping Chen , Xiaojin Zhang , Wei Chen

Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world…

Software Engineering · Computer Science 2025-04-01 Daniel Ramos , Claudia Mamede , Kush Jain , Paulo Canelas , Catarina Gamboa , Claire Le Goues

DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code

Debugging consumes a substantial portion of the software development lifecycle, yet the effectiveness of Large Language Models(LLMs) in this task is not well understood. Competitive programming offers a rich benchmark for such evaluation,…

Software Engineering · Computer Science 2026-03-23 Nabiha Parvez , Tanvin Sarkar Pallab , Mia Mohammad Imran , Tarannum Shaila Zaman

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

An Empirical Study on the Capability of LLMs in Decomposing Bug Reports

Background: Bug reports are essential to the software development life cycle. They help developers track and resolve issues, but are often difficult to process due to their complexity, which can delay resolution and affect software quality.…

Software Engineering · Computer Science 2025-04-30 Zhiyuan Chen , Vanessa Nava-Camal , Ahmad Suleiman , Yiming Tang , Daqing Hou , Weiyi Shang

Beyond Single Bugs: Benchmarking Large Language Models for Multi-Vulnerability Detection

Large Language Models (LLMs) have demonstrated significant potential in automated software security, particularly in vulnerability detection. However, existing benchmarks primarily focus on isolated, single-vulnerability samples or…

Cryptography and Security · Computer Science 2025-12-30 Chinmay Pushkar , Sanchit Kabra , Dhruv Kumar , Jagat Sesh Challa

What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and…

Software Engineering · Computer Science 2025-10-20 Shihan Dou , Haoxiang Jia , Shenxi Wu , Huiyuan Zheng , Muling Wu , Yunbo Tao , Ming Zhang , Mingxu Chai , Jessica Fan , Zhiheng Xi , Rui Zheng , Yueming Wu , Ming Wen , Tao Gui , Qi Zhang , Xipeng Qiu , Xuanjing Huang

A Systematic Approach for Large Language Models Debugging

Large language models (LLMs) have become central to modern AI workflows, powering applications from open-ended text generation to complex agent-based reasoning. However, debugging these models remains a persistent challenge due to their…

Artificial Intelligence · Computer Science 2026-04-28 Basel Shbita , Anna Lisa Gentile , Bing Zhang , Sungeun An , Shailja Thakur , Shubhi Asthana , Yi Zhou , Saptha Surendran , Farhan Ahmed , Rohan Kulkarni , Yuya Jeremy Ong , Chad DeLuca , Hima Patel

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Exploring the Potential and Limitations of Large Language Models for Novice Program Fault Localization

Novice programmers often face challenges in fault localization due to their limited experience and understanding of programming syntax and logic. Traditional methods like Spectrum-Based Fault Localization (SBFL) and Mutation-Based Fault…

Software Engineering · Computer Science 2025-12-04 Hexiang Xu , Hengyuan Liu , Yonghao Wu , Xiaolan Kang , Xiang Chen , Yong Liu

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode…

Software Engineering · Computer Science 2025-12-23 Le Zhang , Suresh Kothari

TESTEVAL: Benchmarking Large Language Models for Test Case Generation

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma