Related papers: CodeUpdateArena: Benchmarking Knowledge Editing on…

CodeArena: A Collective Evaluation Platform for LLM Code Generation

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted…

Software Engineering · Computer Science 2025-03-04 Mingzhe Du , Anh Tuan Luu , Bin Ji , Xiaobao Wu , Dong Huang , Terry Yue Zhuo , Qian Liu , See-Kiong Ng

Evaluating and Aligning CodeLLMs on Human Preference

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common…

Computation and Language · Computer Science 2024-12-09 Jian Yang , Jiaxi Yang , Ke Jin , Yibo Miao , Lei Zhang , Liqun Yang , Zeyu Cui , Yichang Zhang , Binyuan Hui , Junyang Lin

CODESYNC: Synchronizing Large Language Models with Dynamic Code Evolution at Scale

Large Language Models (LLMs) have exhibited exceptional performance in software engineering yet face challenges in adapting to continually evolving code knowledge, particularly regarding the frequent updates of third-party library APIs.…

Computation and Language · Computer Science 2025-06-19 Chenlong Wang , Zhaoyang Chu , Zhengxiang Cheng , Xuyi Yang , Kaiyue Qiu , Yao Wan , Zhou Zhao , Xuanhua Shi , Dongping Chen

ReCode: Updating Code API Knowledge with Reinforcement Learning

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their…

Computation and Language · Computer Science 2025-11-25 Haoze Wu , Yunzhi Yao , Wenhao Yu , Ningyu Zhang

Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation

Automating the decision of whether a code change requires manual review is vital for maintaining software quality in modern development workflows. However, the emergence of new programming languages and frameworks creates a critical…

Software Engineering · Computer Science 2025-09-08 Yogev Cohen , Dudi Ohayon , Romy Somkin , Yehudit Aperstein , Alexander Apartsin

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Software Engineering · Computer Science 2025-04-09 Jiawei Guo , Ziming Li , Xueling Liu , Kaijing Ma , Tianyu Zheng , Zhouliang Yu , Ding Pan , Yizhi LI , Ruibo Liu , Yue Wang , Shuyue Guo , Xingwei Qu , Xiang Yue , Ge Zhang , Wenhu Chen , Jie Fu

Understanding Robustness of Model Editing in Code LLMs

Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining…

Software Engineering · Computer Science 2026-05-11 Vinaik Chhetri , Moghis Fereidouni , A. B Siddique , Umar Farooq

When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation

The rapid evolution of software libraries creates a significant challenge for Large Language Models (LLMs), whose static parametric knowledge often becomes stale post-training. While retrieval-augmented generation (RAG) is commonly used to…

Software Engineering · Computer Science 2026-04-13 Ahmed Nusayer Ashik , Shaowei Wang , Tse-Hsun Chen , Muhammad Asaduzzaman , Yuan Tian

Benchmarking and Rethinking Knowledge Editing for Large Language Models

Knowledge editing aims to update the embedded knowledge within Large Language Models (LLMs). However, existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation…

Computation and Language · Computer Science 2025-05-27 Guoxiu He , Xin Song , Futing Wang , Aixin Sun

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Large language models (LLMs) have demonstrated strong performance on function-level code generation benchmarks, yet real-world software development increasingly demands class-level implementations that integrate multiple methods,…

Software Engineering · Computer Science 2025-11-06 Musfiqur Rahman , SayedHassan Khatoonabadi , Emad Shihab

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recent LLMs seem capable of planning and reasoning given user instructions, their effectiveness in…

Artificial Intelligence · Computer Science 2025-02-07 Léo Boisvert , Megh Thakkar , Maxime Gasse , Massimo Caccia , Thibault Le Sellier De Chezelles , Quentin Cappart , Nicolas Chapados , Alexandre Lacoste , Alexandre Drouin

A Survey of Code Review Benchmarks and Evaluation Practices in Pre-LLM and LLM Era

Code review is a critical practice in modern software engineering, helping developers detect defects early, improve code quality, and facilitate knowledge sharing. With the rapid advancement of large language models (LLMs), a growing body…

Software Engineering · Computer Science 2026-02-17 Taufiqul Islam Khan , Shaowei Wang , Haoxiang Zhang , Tse-Hsun Chen

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a…

Software Engineering · Computer Science 2025-06-23 Wei Li , Xin Zhang , Zhongxin Guo , Shaoguang Mao , Wen Luo , Guangyue Peng , Yangyu Huang , Houfeng Wang , Scarlett Li

CelloAI Benchmarks: Toward Repeatable Evaluation of AI Assistants

Large Language Models (LLM) are increasingly used for software development, yet existing benchmarks for LLM-based coding assistance do not reflect the constraints of High Energy Physics (HEP) and High Performance Computing (HPC) software.…

High Energy Physics - Experiment · Physics 2026-03-03 Mohammad Atif , Kriti Chopra , Fang-Ying Tsai , Ozgur O. Kilic , Tianle Wang , Zhihua Dong , Douglas Benjamin , Charles Leggett , Meifeng Lin , Paolo Calafiura , Salman Habib

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to…

Artificial Intelligence · Computer Science 2025-09-09 Hao Kang , Chenyan Xiong

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models

Large language models (LLMs) have achieved impressive performance across various natural language benchmarks, prompting a continual need to curate more difficult datasets for larger LLMs, which is costly and time-consuming. In this paper,…

Computation and Language · Computer Science 2024-06-07 Jiahao Ying , Yixin Cao , Yushi Bai , Qianru Sun , Bo Wang , Wei Tang , Zhaojun Ding , Yizhe Yang , Xuanjing Huang , Shuicheng Yan

Resolving Editing-Unlearning Conflicts: A Knowledge Codebook Framework for Large Language Model Updating

Large Language Models (LLMs) excel in natural language processing by encoding extensive human knowledge, but their utility relies on timely updates as knowledge evolves. Updating LLMs involves two key tasks simultaneously: unlearning to…

Computation and Language · Computer Science 2025-02-04 Binchi Zhang , Zhengzhang Chen , Zaiyi Zheng , Jundong Li , Haifeng Chen

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is…

Computation and Language · Computer Science 2024-03-12 Lingyue Fu , Huacan Chai , Shuang Luo , Kounianhua Du , Weiming Zhang , Longteng Fan , Jiayi Lei , Renting Rui , Jianghao Lin , Yuchen Fang , Yifan Liu , Jingkuan Wang , Siyuan Qi , Kangning Zhang , Weinan Zhang , Yong Yu

Byam: Fixing Breaking Dependency Updates with Large Language Models

Application Programming Interfaces (APIs) facilitate the integration of third-party dependencies within the code of client applications. However, changes to an API, such as deprecation, modification of parameter names or types, or complete…

Software Engineering · Computer Science 2026-04-14 Frank Reyes , May Mahmoud , Federico Bono , Sarah Nadi , Benoit Baudry , Martin Monperrus