Related papers: GitChameleon 2.0: Evaluating AI Code Generation Ag…

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

The rapid evolution of software libraries presents a significant challenge for code generation models, which must adapt to frequent version updates while maintaining compatibility with previous versions. Existing code completion benchmarks…

Software Engineering · Computer Science 2024-11-12 Nizar Islah , Justine Gehring , Diganta Misra , Eilif Muller , Irina Rish , Terry Yue Zhuo , Massimo Caccia

GPT-4.1 Sets the Standard in Automated Experiment Design Using Novel Python Libraries

Large Language Models (LLMs) have advanced rapidly as tools for automating code generation in scientific research, yet their ability to interpret and use unfamiliar Python APIs for complex computational experiments remains poorly…

Software Engineering · Computer Science 2025-09-17 Nuno Fachada , Daniel Fernandes , Carlos M. Fernandes , Bruno D. Ferreira-Saraiva , João P. Matos-Carvalho

CIFE: Code Instruction-Following Evaluation

Large Language Models (LLMs) are increasingly applied to real-world code generation, where functional correctness alone is insufficient for reliable deployment, developers also expect adherence to explicit requirements for robustness,…

Software Engineering · Computer Science 2025-12-22 Sravani Gunnu , Shanmukha Guttula , Hima Patel

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

The rapid evolution of software libraries creates a significant challenge for Large Language Models (LLMs), whose static parametric knowledge often becomes stale post-training. While retrieval-augmented generation (RAG) is commonly used to…

Software Engineering · Computer Science 2026-04-13 Ahmed Nusayer Ashik , Shaowei Wang , Tse-Hsun Chen , Muhammad Asaduzzaman , Yuan Tian

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

A Comprehensive Framework for Evaluating API-oriented Code Generation in Large Language Models

Large language models (LLMs) like GitHub Copilot and ChatGPT have emerged as powerful tools for code generation, significantly enhancing productivity and accelerating software development. However, existing benchmarks primarily focus on…

Software Engineering · Computer Science 2024-09-27 Yixi Wu , Pengfei He , Zehao Wang , Shaowei Wang , Yuan Tian , Tse-Hsun Chen

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs.…

Computation and Language · Computer Science 2025-06-19 Chenlong Wang , Zhaoyang Chu , Zhengxiang Cheng , Xuyi Yang , Kaiyue Qiu , Yao Wan , Zhou Zhao , Xuanhua Shi , Dongping Chen

LibEvolutionEval: A Benchmark and Study for Version-Specific Code Generation

Recent advancements in code completion models have primarily focused on local file contexts. However, these studies do not fully capture the complexity of real-world software development, which often requires the use of rapidly-evolving…

Software Engineering · Computer Science 2025-05-07 Sachit Kuhar , Wasi Uddin Ahmad , Zijian Wang , Nihal Jain , Haifeng Qian , Baishakhi Ray , Murali Krishna Ramanathan , Xiaofei Ma , Anoop Deoras

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a…

Software Engineering · Computer Science 2025-06-23 Wei Li , Xin Zhang , Zhongxin Guo , Shaoguang Mao , Wen Luo , Guangyue Peng , Yangyu Huang , Houfeng Wang , Scarlett Li

Holistic Evaluation of State-of-the-Art LLMs for Code Generation

This study presents a comprehensive empirical evaluation of six state-of-the-art large language models (LLMs) for code generation, including both general-purpose and code-specialized models. Using a dataset of 944 real-world LeetCode…

Software Engineering · Computer Science 2025-12-23 Le Zhang , Suresh Kothari

Energy-Aware Code Generation with LLMs: Benchmarking Small vs. Large Language Models for Sustainable AI Programming

Large Language Models (LLMs) are widely used for code generation. However, commercial models like ChatGPT require significant computing power, which leads to high energy use and carbon emissions. This has raised concerns about their…

Software Engineering · Computer Science 2025-08-13 Humza Ashraf , Syed Muhammad Danish , Aris Leivadeas , Yazan Otoum , Zeeshan Sattar

CodeMirage: A Multi-Lingual Benchmark for Detecting AI-Generated and Paraphrased Source Code from Production-Level LLMs

Large language models (LLMs) have become integral to modern software development, producing vast amounts of AI-generated source code. While these models boost programming productivity, their misuse introduces critical risks, including code…

Software Engineering · Computer Science 2025-06-16 Hanxi Guo , Siyuan Cheng , Kaiyuan Zhang , Guangyu Shen , Xiangyu Zhang

VersiCode: Towards Version-controllable Code Generation

Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development, marked by frequent library updates. This gap significantly limits LLMs'…

Software Engineering · Computer Science 2024-10-17 Tongtong Wu , Weigang Wu , Xingyu Wang , Kang Xu , Suyu Ma , Bo Jiang , Ping Yang , Zhenchang Xing , Yuan-Fang Li , Gholamreza Haffari

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure…

Software Engineering · Computer Science 2023-11-01 Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks

To adequately test modern code generation systems, evaluation benchmarks must execute and test the code generated by the system. However, these execution and testing requirements have largely limited benchmarks to settings where code is…

Software Engineering · Computer Science 2024-10-04 Yiqing Xie , Alex Xie , Divyanshu Sheth , Pengfei Liu , Daniel Fried , Carolyn Rose

Evaluating Large Language Models for Code Review

Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not…

Software Engineering · Computer Science 2025-05-27 Umut Cihan , Arda İçöz , Vahid Haratian , Eray Tüzün

AI-Generated Code Is Not Reproducible (Yet): An Empirical Study of Dependency Gaps in LLM-Based Coding Agents

The rise of Large Language Models (LLMs) as coding agents promises to accelerate software development, but their impact on generated code reproducibility remains largely unexplored. This paper presents an empirical study investigating…

Software Engineering · Computer Science 2026-03-25 Bhanu Prakash Vangala , Ali Adibifar , Ashish Gehani , Tanu Malik

Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation

In recent years, large language models (LLMs) have emerged as powerful tools with potential applications in various fields, including software engineering. Within the scope of this research, we evaluate five different state-of-the-art LLMs…

Computation and Language · Computer Science 2024-09-09 Luis Mayer , Christian Heumann , Matthias Aßenmacher

Comparing Human and LLM Generated Code: The Jury is Still Out!

Much is promised in relation to AI-supported software development. However, there has been limited evaluation effort in the research domain aimed at validating the true utility of such techniques, especially when compared to human coding…

Software Engineering · Computer Science 2025-01-29 Sherlock A. Licorish , Ansh Bajpai , Chetan Arora , Fanyu Wang , Kla Tantithamthavorn