Related papers: MATCH: Task-Driven Code Evaluation through Contras…

MATCHA: Matching Text via Contrastive Semantic Alignment

Reliable evaluation is essential for understanding large language model (LLM) performance, yet today's go-to metrics, namely token-overlap scores (e.g., ROUGE) and embedding-based measures (e.g., BERTScore), often misjudge semantic…

Computation and Language · Computer Science 2026-05-27 Siran Li , Ece Sena Etoglu , Carsten Eickhoff , Seyed Ali Bahrainian

Deep Assessment of Code Review Generation Approaches: Beyond Lexical Similarity

Code review is a standard practice for ensuring the quality of software projects, and recent research has focused extensively on automated code review. While significant advancements have been made in generating code reviews, the automated…

Software Engineering · Computer Science 2025-01-10 Yanjie Jiang , Hui Liu , Tianyi Chen , Fu Fan , Chunhao Dong , Kui Liu , Lu Zhang

ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine…

Artificial Intelligence · Computer Science 2024-01-23 Terry Yue Zhuo

ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation

Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of…

Computation and Language · Computer Science 2025-11-25 Xiao Wang , Daniil Larionov , Siwei Wu , Yiqi Liu , Steffen Eger , Nafise Sadat Moosavi , Chenghua Lin

Aligning Offline Metrics and Human Judgments of Value for Code Generation Models

Large language models have demonstrated great potential to assist programmers in generating code. For such human-AI pair programming scenarios, we empirically demonstrate that while generated code is most often evaluated in terms of their…

Software Engineering · Computer Science 2023-06-14 Victor Dibia , Adam Fourney , Gagan Bansal , Forough Poursabzi-Sangdeh , Han Liu , Saleema Amershi

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

Evaluation metrics play a vital role in the growth of an area as it defines the standard of distinguishing between good and bad models. In the area of code synthesis, the commonly used evaluation metric is BLEU or perfect accuracy, but they…

Software Engineering · Computer Science 2020-09-29 Shuo Ren , Daya Guo , Shuai Lu , Long Zhou , Shujie Liu , Duyu Tang , Neel Sundaresan , Ming Zhou , Ambrosio Blanco , Shuai Ma

CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells

The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change…

Software Engineering · Computer Science 2025-03-18 Atharva Naik , Marcus Alenius , Daniel Fried , Carolyn Rose

CodeScore: Evaluating Code Generation by Learning Code Execution

A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from…

Software Engineering · Computer Science 2024-09-06 Yihong Dong , Jiazheng Ding , Xue Jiang , Ge Li , Zhuo Li , Zhi Jin

Boosting Commit Classification with Contrastive Learning

Commit Classification (CC) is an important task in software maintenance, which helps software developers classify code changes into different types according to their nature and purpose. It allows developers to understand better how their…

Software Engineering · Computer Science 2023-08-17 Jiajun Tong , Zhixiao Wang , Xiaobin Rui

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been…

Software Engineering · Computer Science 2023-04-14 Pietro Liguori , Cristina Improta , Roberto Natella , Bojan Cukic , Domenico Cotroneo

Bridging LLM-Generated Code and Requirements: Reverse Generation technique and SBC Metric for Developer Insights

The rise of Large Language Models (LLMs) in software engineering, particularly in code generation, has garnered significant attention. However, assessing the quality of AI-generated code remains a challenge due to the inherent complexity of…

Software Engineering · Computer Science 2025-02-13 Ahilan Ayyachamy Nadar Ponnusamy

CLARC: C/C++ Benchmark for Robust Code Search

Efficient code retrieval is critical for developer productivity, yet existing benchmarks largely focus on Python and rarely stress-test robustness beyond superficial lexical cues. To address the gap, we introduce an automated pipeline for…

Software Engineering · Computer Science 2026-03-06 Kaicheng Wang , Liyan Huang , Weike Fang , Weihang Wang

CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation

Code generation is a longstanding challenge, aiming to generate a code snippet based on a natural language description. Usually, expensive text-code paired data is essential for training a code generation model. Recently, thanks to the…

Software Engineering · Computer Science 2022-06-15 Daoguang Zan , Bei Chen , Dejian Yang , Zeqi Lin , Minsu Kim , Bei Guan , Yongji Wang , Weizhu Chen , Jian-Guang Lou

ChatGPT vs. DeepSeek: A Comparative Study on AI-Based Code Generation

Background: AI-powered code generation, fueled by Large Language Models (LLMs), is revolutionizing software development. Models like OpenAI's Codex and GPT-4, alongside DeepSeek, leverage vast code and natural language datasets. However,…

Software Engineering · Computer Science 2025-02-27 Md Motaleb Hossen Manik

On Assessing the Relevance of Code Reviews Authored by Generative Models

The use of large language models like ChatGPT in code review offers promising efficiency gains but also raises concerns about correctness and safety. Existing evaluation methods for code review generation either rely on automatic…

Software Engineering · Computer Science 2025-12-18 Robert Heumüller , Frank Ortmeier

CoMatch: Semi-supervised Learning with Contrastive Graph Regularization

Semi-supervised learning has been an effective paradigm for leveraging unlabeled data to reduce the reliance on labeled data. We propose CoMatch, a new semi-supervised learning method that unifies dominant approaches and addresses their…

Machine Learning · Computer Science 2021-03-04 Junnan Li , Caiming Xiong , Steven Hoi

Automating the Correctness Assessment of AI-generated Code for Security Contexts

Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses…

Software Engineering · Computer Science 2024-06-11 Domenico Cotroneo , Alessio Foggia , Cristina Improta , Pietro Liguori , Roberto Natella

CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming

Competitive programming benchmarks are widely used in scenarios such as programming contests and large language model assessments. However, the growing presence of duplicate or highly similar problems raises concerns not only about…

Software Engineering · Computer Science 2025-10-28 Han Deng , Yuan Meng , Shixiang Tang , Wanli Ouyang , Xinzhu Ma

Cross-Examination Framework: A Task-Agnostic Diagnostic for Information Fidelity in Text-to-Text Generation

Traditional metrics like BLEU and BERTScore fail to capture semantic fidelity in generative text-to-text tasks. We adapt the Cross-Examination Framework (CEF) for a reference-free, multi-dimensional evaluation by treating the source and…

Computation and Language · Computer Science 2026-01-28 Tathagata Raha , Clement Christophe , Nada Saadi , Hamza A Javed , Marco AF Pimentel , Ronnie Rajan , Praveenkumar Kanithi

Adapting Standard Retrieval Benchmarks to Evaluate Generated Answers

Large language models can now directly generate answers to many factual questions without referencing external sources. Unfortunately, relatively little attention has been paid to methods for evaluating the quality and correctness of these…

Information Retrieval · Computer Science 2024-01-11 Negar Arabzadeh , Amin Bigdeli , Charles L. A. Clarke