Related papers: Enabling BLV Developers with LLM-driven Code Debug…

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs.…

Software Engineering · Computer Science 2024-06-12 Li Zhong , Zilong Wang , Jingbo Shang

DebugBench: Evaluating Debugging Capability of Large Language Models

Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs'…

Software Engineering · Computer Science 2024-06-07 Runchu Tian , Yining Ye , Yujia Qin , Xin Cong , Yankai Lin , Yinxu Pan , Yesai Wu , Haotian Hui , Weichuan Liu , Zhiyuan Liu , Maosong Sun

ChatDBG: Augmenting Debugging with Large Language Models

Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of…

Software Engineering · Computer Science 2025-06-23 Kyla H. Levin , Nicolas van Kempen , Emery D. Berger , Stephen N. Freund

Debugging with Open-Source Large Language Models: An Evaluation

Large language models have shown good potential in supporting software development tasks. This is why more and more developers turn to LLMs (e.g. ChatGPT) to support them in fixing their buggy code. While this can save time and effort, many…

Software Engineering · Computer Science 2024-09-06 Yacine Majdoub , Eya Ben Charrada

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating…

Programming Languages · Computer Science 2022-12-22 Shailja Thakur , Baleegh Ahmad , Zhenxing Fan , Hammond Pearce , Benjamin Tan , Ramesh Karri , Brendan Dolan-Gavitt , Siddharth Garg

An Empirical Study on the Capability of LLMs in Decomposing Bug Reports

Background: Bug reports are essential to the software development life cycle. They help developers track and resolve issues, but are often difficult to process due to their complexity, which can delay resolution and affect software quality.…

Software Engineering · Computer Science 2025-04-30 Zhiyuan Chen , Vanessa Nava-Camal , Ahmad Suleiman , Yiming Tang , Daqing Hou , Weiyi Shang

MdEval: Massively Multilingual Code Debugging

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

Computation and Language · Computer Science 2025-02-25 Shukai Liu , Linzheng Chai , Jian Yang , Jiajun Shi , He Zhu , Liran Wang , Ke Jin , Wei Zhang , Hualei Zhu , Shuyue Guo , Tao Sun , Jiaheng Liu , Yunlong Duan , Yu Hao , Liqun Yang , Guanglin Niu , Ge Zhang , Zhoujun Li

SPROUT: an Interactive Authoring Tool for Generating Programming Tutorials with the Visualization of Large Language Models

The rapid development of large language models (LLMs), such as ChatGPT, has revolutionized the efficiency of creating programming tutorials. LLMs can be instructed with text prompts to generate comprehensive text descriptions of code…

Human-Computer Interaction · Computer Science 2024-10-29 Yihan Liu , Zhen Wen , Luoxuan Weng , Ollie Woodman , Yi Yang , Wei Chen

SimulBench: Evaluating Language Models with Creative Simulation Tasks

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation…

Computation and Language · Computer Science 2024-09-13 Qi Jia , Xiang Yue , Tianyu Zheng , Jie Huang , Bill Yuchen Lin

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A…

Software Engineering · Computer Science 2025-06-27 Shirley Kokane , Ming Zhu , Tulika Awalgaonkar , Jianguo Zhang , Thai Hoang , Akshara Prabhakar , Zuxin Liu , Tian Lan , Liangwei Yang , Juntao Tan , Rithesh Murthy , Weiran Yao , Zhiwei Liu , Juan Carlos Niebles , Huan Wang , Shelby Heinecke , Caiming Xiong , Silivo Savarese

In-IDE Toolkit for Developers of AI-Based Features

AI-enabled features built on LLMs and agentic workflows are difficult to test, debug, and reproduce, especially for product-focused software engineers without a machine learning background. We present the AI Toolkit plugin for JetBrains…

Software Engineering · Computer Science 2026-05-15 Yaroslav Sokolov , Yury Khudyakov , Lenar Sharipov , Andrei Gasparian , Parth Tiwary , Artem Trofimov

Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs

LLM-based assistants, such as GitHub Copilot and ChatGPT, have the potential to generate code that fulfills a programming task described in a natural language description, referred to as a prompt. The widespread accessibility of these…

Software Engineering · Computer Science 2024-05-24 Sylvain Kouemo Ngassom , Arghavan Moradi Dakhel , Florian Tambon , Foutse Khomh

Can We Enhance Bug Report Quality Using LLMs?: An Empirical Study of LLM-Based Bug Report Generation

Bug reports contain the information developers need to triage and fix software bugs. However, unclear, incomplete, or ambiguous information may lead to delays and excessive manual effort spent on bug triage and resolution. In this paper, we…

Software Engineering · Computer Science 2025-04-29 Jagrit Acharya , Gouri Ginde

Towards a Neural Debugger for Python

Training large language models (LLMs) on Python execution traces grounds them in code execution and enables the line-by-line execution prediction of whole Python programs, effectively turning them into neural interpreters (FAIR CodeGen Team…

Machine Learning · Computer Science 2026-03-11 Maximilian Beck , Jonas Gehring , Jannik Kossen , Gabriel Synnaeve

Codellm-Devkit: A Framework for Contextualizing Code LLMs with Program Analysis Insights

Large Language Models for Code (or code LLMs) are increasingly gaining popularity and capabilities, offering a wide array of functionalities such as code completion, code generation, code summarization, test generation, code translation,…

Software Engineering · Computer Science 2024-10-18 Rahul Krishna , Rangeet Pan , Raju Pavuluri , Srikanth Tamilselvam , Maja Vukovic , Saurabh Sinha

Enhancing LLM-Based Bug Reproduction for Android Apps via Pre-Assessment of Visual Effects

In the development and maintenance of Android apps, the quick and accurate reproduction of user-reported bugs is crucial to ensure application quality and improve user satisfaction. However, this process is often time-consuming and complex.…

Software Engineering · Computer Science 2026-04-01 Xiangyang Xiao , Huaxun Huang , Rongxin Wu

RevMine: An LLM-Assisted Tool for Code Review Mining and Analysis Across Git Platforms

Empirical research on code review processes is increasingly central to understanding software quality and collaboration. However, collecting and analyzing review data remains a time-consuming and technically intensive task. Most researchers…

Software Engineering · Computer Science 2025-10-07 Samah Kansab , Francis Bordeleau , Ali Tizghadam

VeriLLMed: Interactive Visual Debugging of Medical Large Language Models with Knowledge Graphs

Large language models (LLMs) show promise in medical diagnosis, but real-world deployment remains challenging due to high-stakes clinical decisions and imperfect reasoning reliability. As a result, careful inspection of model behavior is…

Computation and Language · Computer Science 2026-04-28 Yurui Xiang , Xingyi Mao , Rui Sheng , Zixin Chen , Zelin Zang , Yuyang Wu , Haipeng Zeng , Huamin Qu , Yushi Sun , Yanna Lin

libRoadRunner: A High Performance SBML Simulation and Analysis Library

This paper presents libRoadRunner, an extensible, high-performance, cross-platform, open-source software library for the simulation and analysis of models \ expressed using Systems Biology Markup Language (SBML). SBML is the most widely…

Subcellular Processes · Quantitative Biology 2015-03-04 Endre T. Somogyi , Jean-Marie Bouteiller , James A. Glazier , Matthias König , Kyle Medley , Maciej H. Swat , Herbert M. Sauro