Related papers: Gistify! Codebase-Level Understanding via Runtime …

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program…

Software Engineering · Computer Science 2026-03-05 Alex Thillen , Niels Mündler , Veselin Raychev , Martin Vechev

Understanding Codebase like a Professional! Human-AI Collaboration for Code Comprehension

Understanding an unfamiliar codebase is an essential task for developers in various scenarios, such as during the onboarding process. Especially when the codebase is large and time is limited, achieving a decent level of comprehension…

Human-Computer Interaction · Computer Science 2026-02-16 Jie Gao , Yue Xue , Xiaofei Xie , SoeMin Thant , Erika Lee , Bowen Xu

SelfPiCo: Self-Guided Partial Code Execution with LLMs

Code executability plays a vital role in software debugging and testing (e.g., detecting runtime exceptions or assertion violations). However, code execution, especially partial or arbitrary code execution, is a non-trivial task due to…

Software Engineering · Computer Science 2024-07-25 Zhipeng Xue , Zhipeng Gao , Shaohua Wang , Xing Hu , Xin Xia , Shanping Li

CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated…

Software Engineering · Computer Science 2025-08-05 Kaiwen Yan , Hongcheng Guo , Xuanqing Shi , Shaosheng Cao , Donglin Di , Zhoujun Li

Gistable: Evaluating the Executability of Python Code Snippets on GitHub

Software developers create and share code online to demonstrate programming language concepts and programming tasks. Code snippets can be a useful way to explain and demonstrate a programming concept, but may not always be directly…

Software Engineering · Computer Science 2018-08-16 Eric Horton , Chris Parnin

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks…

Software Engineering · Computer Science 2024-11-12 Nizar Islah , Justine Gehring , Diganta Misra , Eilif Muller , Irina Rish , Terry Yue Zhuo , Massimo Caccia

CodePlan: Repository-level Coding using LLMs and Planning

Software engineering activities such as package migration, fixing errors reports from static analysis or testing, and adding type annotations or other specifications to a codebase, involve pervasively editing the entire repository of code.…

Software Engineering · Computer Science 2023-09-25 Ramakrishna Bairi , Atharv Sonwane , Aditya Kanade , Vageesh D C , Arun Iyer , Suresh Parthasarathy , Sriram Rajamani , B. Ashok , Shashank Shet

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by…

Computation and Language · Computer Science 2024-10-17 Haau-Sing Li , Patrick Fernandes , Iryna Gurevych , André F. T. Martins

CodeRefine: A Pipeline for Enhancing LLM-Generated Code Implementations of Research Papers

This paper presents CodeRefine, a novel framework for automatically transforming research paper methodologies into functional code using Large Language Models (LLMs). Our multi-step approach first extracts and summarizes key text chunks…

Computation and Language · Computer Science 2026-03-27 Ekaterina Trofimova , Emil Sataev , Abhijit Singh Jowhari

CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation

The advent of large language models (LLMs) has greatly facilitated code generation, but ensuring the functional correctness of generated code remains a challenge. Traditional validation methods are often time-consuming, error-prone, and…

Software Engineering · Computer Science 2024-08-29 Pooja Aggarwal , Oishik Chatterjee , Ting Dai , Prateeti Mohapatra , Brent Paulovicks , Brad Blancett , Arthur De Magalhaes

Uncovering Code Insights: Leveraging GitHub Artifacts for Deeper Code Understanding

Understanding the purpose of source code is a critical task in software maintenance, onboarding, and modernization. While large language models (LLMs) have shown promise in generating code explanations, they often lack grounding in the…

Software Engineering · Computer Science 2025-11-06 Ziv Nevo , Orna Raz , Karen Yorav

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls,…

Software Engineering · Computer Science 2026-04-27 Changshu Liu , Alireza Ghazanfari , Yang Chen , Reyhaneh Jabbarvand

CIFE: Code Instruction-Following Evaluation

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness,…

Software Engineering · Computer Science 2025-12-22 Sravani Gunnu , Shanmukha Guttula , Hima Patel

Personalized Code Readability Assessment: Are We There Yet?

Unreadable code could be a breeding ground for errors. Thus, previous work defined approaches based on machine learning to automatically assess code readability that can warn developers when some code artifacts (e.g., classes) become…

Software Engineering · Computer Science 2025-03-12 Antonio Vitale , Emanuela Guglielmi , Rocco Oliveto , Simone Scalabrino

Grounding Data Science Code Generation with Input-Output Specifications

Large language models (LLMs) have recently demonstrated a remarkable ability to generate code from natural language (NL) prompts. However, in the real world, NL is often too ambiguous to capture the true intent behind programming problems,…

Machine Learning · Computer Science 2024-03-18 Yeming Wen , Pengcheng Yin , Kensen Shi , Henryk Michalewski , Swarat Chaudhuri , Alex Polozov

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

In the field of robotics, researchers face a critical challenge in ensuring reliable and efficient task planning. Verifying high-level task plans before execution significantly reduces errors and enhance the overall performance of these…

Robotics · Computer Science 2025-07-08 Danil S. Grigorev , Alexey K. Kovalev , Aleksandr I. Panov

LLM Code Customization with Visual Results: A Benchmark on TikZ

With the rise of AI-based code generation, customizing existing code out of natural language instructions to modify visual results -such as figures or images -has become possible, promising to reduce the need for deep programming expertise.…

Software Engineering · Computer Science 2025-06-05 Charly Reux , Mathieu Acher , Djamel Eddine Khelladi , Olivier Barais , Clément Quinton

Code Execution as Grounded Supervision for LLM Reasoning

Training large language models (LLMs) with chain-of-thought (CoT) supervision has proven effective for enhancing their reasoning abilities. However, obtaining reliable and accurate reasoning supervision remains a significant challenge. We…

Computation and Language · Computer Science 2025-10-21 Dongwon Jung , Wenxuan Zhou , Muhao Chen

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra