Related papers: Precise Debugging Benchmark: Is Your Model Debuggi…

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs.…

Software Engineering · Computer Science 2024-06-12 Li Zhong , Zilong Wang , Jingbo Shang

Real Faults in Deep Learning Fault Benchmarks: How Real Are They?

As the adoption of Deep Learning (DL) systems continues to rise, an increasing number of approaches are being proposed to test these systems, localise faults within them, and repair those faults. The best attestation of effectiveness for…

Software Engineering · Computer Science 2024-12-24 Gunel Jahangirova , Nargiz Humbatova , Jinhan Kim , Shin Yoo , Paolo Tonella

Why Stop at One Error? Benchmarking LLMs as Data Science Code Debuggers for Multi-Hop and Multi-Bug Errors

LLMs are transforming software development, yet current code generation and code repair benchmarks mainly assess syntactic and functional correctness in simple, single-error cases. LLMs' capabilities to autonomously find and fix runtime…

Computation and Language · Computer Science 2025-09-17 Zhiyu Yang , Shuo Wang , Yukun Yan , Yang Deng

CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging

Debugging CUDA programs has long been challenging because failures often arise from subtle interactions among hardware behavior, compiler decisions, memory hierarchy, and asynchronous execution. More importantly, with the rapid expansion of…

Machine Learning · Computer Science 2026-05-27 Shiyang Li , Haoyang Chen , Mattia Fazzini , Caiwen Ding

DePro: Understanding the Role of LLMs in Debugging Competitive Programming Code

Debugging consumes a substantial portion of the software development lifecycle, yet the effectiveness of Large Language Models(LLMs) in this task is not well understood. Competitive programming offers a rich benchmark for such evaluation,…

Software Engineering · Computer Science 2026-03-23 Nabiha Parvez , Tanvin Sarkar Pallab , Mia Mohammad Imran , Tarannum Shaila Zaman

DebugBench: Evaluating Debugging Capability of Large Language Models

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs'…

Software Engineering · Computer Science 2024-06-07 Runchu Tian , Yining Ye , Yujia Qin , Xin Cong , Yankai Lin , Yinxu Pan , Yesai Wu , Haotian Hui , Weichuan Liu , Zhiyuan Liu , Maosong Sun

Teaching Large Language Models to Self-Debug

Large language models (LLMs) have achieved impressive performance on code generation. However, for complex programming tasks, generating the correct solution in one go becomes challenging, thus some prior works have designed program repair…

Computation and Language · Computer Science 2023-10-06 Xinyun Chen , Maxwell Lin , Nathanael Schärli , Denny Zhou

Revisit Self-Debugging with Self-Generated Tests for Code Generation

Large language models (LLMs) have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of…

Software Engineering · Computer Science 2025-01-23 Xiancai Chen , Zhengwei Tao , Kechi Zhang , Changzhi Zhou , Wanli Gu , Yuanpeng He , Mengdi Zhang , Xunliang Cai , Haiyan Zhao , Zhi Jin

Bugs in Machine Learning-based Systems: A Faultload Benchmark

The rapid escalation of applying Machine Learning (ML) in various domains has led to paying more attention to the quality of ML components. There is then a growth of techniques and tools aiming at improving the quality of ML components and…

Software Engineering · Computer Science 2023-01-18 Mohammad Mehdi Morovati , Amin Nikanjam , Foutse Khomh , Zhen Ming , Jiang

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

DebugLM: Learning Traceable Training Data Provenance for LLMs

Large language models (LLMs) are trained through multi-stage pipelines over heterogeneous data sources, yet developers lack a principled way to pinpoint the specific data responsible for an observed behavior. This lack of observability…

Computation and Language · Computer Science 2026-03-19 Wenjie Jacky Mo , Qin Liu , Xiaofei Wen , Wenxuan Zhou , Zhe Zhao , Muhao Chen

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Are Large Language Models Memorizing Bug Benchmarks?

Large Language Models (LLMs) have become integral to various software engineering tasks, including code generation, bug detection, and repair. To evaluate model performance in these domains, numerous bug benchmarks containing real-world…

Software Engineering · Computer Science 2025-04-01 Daniel Ramos , Claudia Mamede , Kush Jain , Paulo Canelas , Catarina Gamboa , Claire Le Goues

LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code

Large Language Models (LLMs) have demonstrated remarkable performance in code completion. However, the training data used to develop these models often contain a significant amount of buggy code. Yet, it remains unclear to what extent these…

Software Engineering · Computer Science 2025-03-17 Liwei Guo , Sixiang Ye , Zeyu Sun , Xiang Chen , Yuxia Zhang , Bo Wang , Jie M. Zhang , Zheng Li , Yong Liu

Learning to Generate Unit Tests for Automated Debugging

Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test…

Software Engineering · Computer Science 2025-08-22 Archiki Prasad , Elias Stengel-Eskin , Justin Chih-Yao Chen , Zaid Khan , Mohit Bansal

Do AI models help produce verified bug fixes?

Among areas of software engineering where AI techniques -- particularly, Large Language Models -- seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory…

Software Engineering · Computer Science 2025-08-05 Li Huang , Ilgiz Mustafin , Marco Piccioni , Alessandro Schena , Reto Weber , Bertrand Meyer

Effective Large Language Model Debugging with Best-first Tree Search

Large Language Models (LLMs) show promise in code generation tasks. However, their code-writing abilities are often limited in scope: while they can successfully implement simple functions, they struggle with more complex tasks. A…

Software Engineering · Computer Science 2024-07-30 Jialin Song , Jonathan Raiman , Bryan Catanzaro

LeDex: Training LLMs to Better Self-Debug and Explain Code

In the domain of code generation, self-debugging is crucial. It allows LLMs to refine their generated code based on execution feedback. This is particularly important because generating correct solutions in one attempt proves challenging…

Computation and Language · Computer Science 2025-02-17 Nan Jiang , Xiaopeng Li , Shiqi Wang , Qiang Zhou , Soneya Binta Hossain , Baishakhi Ray , Varun Kumar , Xiaofei Ma , Anoop Deoras

From Code to Correctness: Closing the Last Mile of Code Generation with Hierarchical Debugging

While large language models have made significant strides in code generation, the pass rate of the generated code is bottlenecked on subtle errors, often requiring human intervention to pass tests, especially for complex problems. Existing…

Computation and Language · Computer Science 2025-11-25 Yuling Shi , Songsong Wang , Chengcheng Wan , Min Wang , Xiaodong Gu

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program…

Cryptography and Security · Computer Science 2024-08-22 Yu Liu , Lang Gao , Mingxin Yang , Yu Xie , Ping Chen , Xiaojin Zhang , Wei Chen