Related papers: JMigBench: A Benchmark for Evaluating LLMs on Sour…

MigrationBench: Repository-Level Code Migration Benchmark from Java 8

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark…

Software Engineering · Computer Science 2026-05-29 Linbo Liu , Xinle Liu , Qiang Zhou , Lin Chen , Yihan Liu , Hoan Nguyen , Behrooz Omidvar-Tehrani , Xi Shen , Jun Huan , Omer Tripp , Anoop Deoras

CODEMENV: Benchmarking Large Language Models on Code Migration

Large language models (LLMs) have shown remarkable capabilities across various software engineering tasks; however, their effectiveness in code migration, adapting code to run in different environments, remains insufficiently studied. In…

Software Engineering · Computer Science 2025-06-03 Keyuan Cheng , Xudong Shen , Yihao Yang , Tengyue Wang , Yang Cao , Muhammad Asif Ali , Hanbin Wang , Lijie Hu , Di Wang

Merge-Bench: Resolve Merge Conflicts with Large Language Models

This paper applies machine learning to the difficult and important task of version control merging. (1) We constructed a dataset, Merge-Bench, of 7938 real-world merge conflict hunks from 1439 GitHub repositories. The ground truth is the…

Machine Learning · Computer Science 2026-05-26 Benedikt Schesch , Michael D. Ernst

FreshBrew: A Benchmark for Evaluating AI Agents on Java Code Migration

AI coding assistants are rapidly becoming integral to modern software development. A key challenge in this space is the continual need to migrate and modernize codebases in response to evolving software ecosystems. Traditionally, such…

Software Engineering · Computer Science 2025-10-14 Victor May , Diganta Misra , Yanqi Luo , Anjali Sridhar , Justine Gehring , Silvio Soares Ribeiro Junior

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

CodeEditorBench: Evaluating Code Editing Capability of Large Language Models

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Software Engineering · Computer Science 2025-04-09 Jiawei Guo , Ziming Li , Xueling Liu , Kaijing Ma , Tianyu Zheng , Zhouliang Yu , Ding Pan , Yizhi LI , Ruibo Liu , Yue Wang , Shuyue Guo , Xingwei Qu , Xiang Yue , Ge Zhang , Wenhu Chen , Jie Fu

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks

With the advancement of automated software engineering, research focus is increasingly shifting toward practical tasks reflecting the day-to-day work of software engineers. Among these tasks, software migration, a critical process of…

Software Engineering · Computer Science 2026-04-29 Ryo Fujii , Makoto Morishita , Kazuki Yano , Jun Suzuki

An Empirical Study of Python Library Migration Using Large Language Models

Library migration is the process of replacing one library with another library that provides similar functionality. Manual library migration is time consuming and error prone, as it requires developers to understand the APIs of both…

Software Engineering · Computer Science 2025-10-14 Md Mohayeminul Islam , Ajay Kumar Jha , May Mahmoud , Ildar Akhmetov , Sarah Nadi

Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study

Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks, including error detection, clone detection, and code translation, primarily leveraging high-resource programming languages like…

Computation and Language · Computer Science 2025-06-11 Razan Baltaji , Saurabh Pujar , Louis Mandel , Martin Hirzel , Luca Buratti , Lav Varshney

Secret Breach Detection in Source Code with Large Language Models

Background: Leaking sensitive information - such as API keys, tokens, and credentials - in source code remains a persistent security threat. Traditional regex and entropy-based tools often generate high false positives due to limited…

Software Engineering · Computer Science 2025-07-29 Md Nafiu Rahman , Sadif Ahmed , Zahin Wahab , S M Sohan , Rifat Shahriyar

Assertion Messages with Large Language Models (LLMs) for Code

Assertion messages significantly enhance unit tests by clearly explaining the reasons behind test failures, yet they are frequently omitted by developers and automated test-generation tools. Despite recent advancements, Large Language…

Software Engineering · Computer Science 2025-09-25 Ahmed Aljohani , Anamul Haque Mollah , Hyunsook Do

CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages

Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large…

Computation and Language · Computer Science 2025-09-09 Yilun Yang , Yekun Chai

Can Large Language Models Replace Human Coders? Introducing ContentBench

Can low-cost large language models (LLMs) take over the interpretive coding work that still anchors much of empirical content analysis? This paper introduces ContentBench, a public benchmark suite that helps answer this replacement question…

Computers and Society · Computer Science 2026-02-24 Michael Haman

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code…

Software Engineering · Computer Science 2024-09-16 Yuwei Zhao , Ziyang Luo , Yuchen Tian , Hongzhan Lin , Weixiang Yan , Annan Li , Jing Ma

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail…

Software Engineering · Computer Science 2024-03-29 Zhengran Zeng , Yidong Wang , Rui Xie , Wei Ye , Shikun Zhang

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation

Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance, yet systematic benchmarks for evaluating large language models on this problem remain scarce. In this work, we present…

Software Engineering · Computer Science 2025-12-02 Parisa Hamedi , Hamed Jelodar , Samita Bai , Mohammad Meymani , Roozbeh Razavi-Far , Ali A. Ghorbani

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation

Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a…

Software Engineering · Computer Science 2025-06-23 Wei Li , Xin Zhang , Zhongxin Guo , Shaoguang Mao , Wen Luo , Guangyue Peng , Yangyu Huang , Houfeng Wang , Scarlett Li