Related papers: CodeJudge: Evaluating Code Generation with Large L…

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by…

Computation and Language · Computer Science 2025-08-15 Hongchao Jiang , Yiming Chen , Yushi Cao , Hung-yi Lee , Robby T. Tan

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code…

Software Engineering · Computer Science 2024-09-16 Yuwei Zhao , Ziyang Luo , Yuchen Tian , Hongzhan Lin , Weixiang Yan , Annan Li , Jing Ma

A Survey on Evaluating Large Language Models in Code Generation Tasks

This paper provides a comprehensive review of the current methods and metrics used to evaluate the performance of Large Language Models (LLMs) in code generation tasks. With the rapid growth in demand for automated software development,…

Software Engineering · Computer Science 2025-03-05 Liguo Chen , Qi Guo , Hongrui Jia , Zhengran Zeng , Xin Wang , Yijiang Xu , Jian Wu , Yidong Wang , Qing Gao , Jindong Wang , Wei Ye , Shikun Zhang

Examination of Code generated by Large Language Models

Large language models (LLMs), such as ChatGPT and Copilot, are transforming software development by automating code generation and, arguably, enable rapid prototyping, support education, and boost productivity. Therefore, correctness and…

Software Engineering · Computer Science 2024-08-30 Robin Beer , Alexander Feix , Tim Guttzeit , Tamara Muras , Vincent Müller , Maurice Rauscher , Florian Schäffler , Welf Löwe

Automated Code Review Using Large Language Models with Symbolic Reasoning

Code review is one of the key processes in the software development lifecycle and is essential to maintain code quality. However, manual code review is subjective and time consuming. Given its rule-based nature, code review is well suited…

Software Engineering · Computer Science 2025-07-25 Busra Icoz , Goksel Biricik

Large Language Models for Code Generation: A Comprehensive Survey of Challenges, Techniques, Evaluation, and Applications

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Improving Code Generation via Small Language Model-as-a-judge

Large language models (LLMs) have shown remarkable capabilities in automated code generation. While effective for mainstream languages, they may underperform on less common or domain-specific languages, prompting companies to develop…

Software Engineering · Computer Science 2026-02-13 Giuseppe Crupi , Rosalia Tufano , Gabriele Bavota

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs

Recently, there has been a growing trend of employing large language models (LLMs) to judge the quality of other LLMs. Many studies have adopted closed-source models, mainly using GPT-4 as the evaluator. However, due to the closed-source…

Artificial Intelligence · Computer Science 2024-10-15 Yijie Li , Yuan Sun

CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement

Large Language Models (LLMs) have revolutionized code generation but require significant resources and often over-generalize, limiting their task-specific efficiency. Fine-tuning smaller, open-source LLMs provides a cost-effective…

Computation and Language · Computer Science 2025-06-27 Leitian Tao , Xiang Chen , Tong Yu , Tung Mai , Ryan Rossi , Yixuan Li , Saayan Mitra

CodeMind: Evaluating Large Language Models for Code Reasoning

Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a…

Software Engineering · Computer Science 2026-04-08 Changshu Liu , Yang Chen , Reyhaneh Jabbarvand

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the "quality" of the output provided by an automated…

Software Engineering · Computer Science 2025-07-23 Giuseppe Crupi , Rosalia Tufano , Alejandro Velasco , Antonio Mastropaolo , Denys Poshyvanyk , Gabriele Bavota

Guided Code Generation with LLMs: A Multi-Agent Framework for Complex Code Tasks

Large Language Models (LLMs) have shown remarkable capabilities in code generation tasks, yet they face significant limitations in handling complex, long-context programming challenges and demonstrating complex compositional reasoning…

Artificial Intelligence · Computer Science 2025-01-14 Amr Almorsi , Mohanned Ahmed , Walid Gomaa

On Evaluating the Efficiency of Source Code Generated by LLMs

Recent years have seen the remarkable capabilities of large language models (LLMs) for code generation. Different from existing work that evaluate the correctness of the code generated by LLMs, we propose to further evaluate its efficiency.…

Software Engineering · Computer Science 2024-04-10 Changan Niu , Ting Zhang , Chuanyi Li , Bin Luo , Vincent Ng

A Survey on Large Language Models for Code Generation

Large Language Models (LLMs) have garnered remarkable advancements across diverse code-related tasks, known as Code LLMs, particularly in code generation that generates source code with LLM from natural language descriptions. This…

Computation and Language · Computer Science 2025-10-28 Juyong Jiang , Fan Wang , Jiasi Shen , Sungju Kim , Sunghun Kim

Measuring Determinism in Large Language Models for Software Code Review

Large Language Models (LLMs) promise to streamline software code reviews, but their ability to produce consistent assessments remains an open question. In this study, we tested four leading LLMs -- GPT-4o mini, GPT-4o, Claude 3.5 Sonnet,…

Software Engineering · Computer Science 2025-03-03 Eugene Klishevich , Yegor Denisov-Blanch , Simon Obstbaum , Igor Ciobanu , Michal Kosinski

Code Evolution Graphs: Understanding Large Language Model Driven Design of Algorithms

Large Language Models (LLMs) have demonstrated great promise in generating code, especially when used inside an evolutionary computation framework to iteratively optimize the generated algorithms. However, in some cases they fail to…

Neural and Evolutionary Computing · Computer Science 2025-03-24 Niki van Stein , Anna V. Kononova , Lars Kotthoff , Thomas Bäck

CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation

The advent of large language models (LLMs) has greatly facilitated code generation, but ensuring the functional correctness of generated code remains a challenge. Traditional validation methods are often time-consuming, error-prone, and…

Software Engineering · Computer Science 2024-08-29 Pooja Aggarwal , Oishik Chatterjee , Ting Dai , Prateeti Mohapatra , Brent Paulovicks , Brad Blancett , Arthur De Magalhaes

Rethinking Code Refinement: Learning to Judge Code Efficiency

Large Language Models (LLMs) have demonstrated impressive capabilities in understanding and generating codes. Due to these capabilities, many recent methods are proposed to automatically refine the codes with LLMs. However, we should…

Software Engineering · Computer Science 2024-10-31 Minju Seo , Jinheon Baek , Sung Ju Hwang

CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models

Pre-trained on massive amounts of code and text data, large language models (LLMs) have demonstrated remarkable achievements in performing code generation tasks. With additional execution-based feedback, these models can act as agents with…

Computation and Language · Computer Science 2024-11-14 Jierui Li , Hung Le , Yingbo Zhou , Caiming Xiong , Silvio Savarese , Doyen Sahoo

CodeGrad: Integrating Multi-Step Verification with Gradient-Based LLM Refinement

While Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, they often produce solutions that lack guarantees of correctness, robustness, and efficiency. This limitation is particularly acute in domains…

Software Engineering · Computer Science 2025-09-04 Yueke Zhang , Yifan Zhang , Kevin Leach , Yu Huang