Related papers: Benchmarking Causal Study to Interpret Large Langu…

Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach

While code generation has been widely used in various software development scenarios, the quality of the generated code is not guaranteed. This has been a particular concern in the era of large language models (LLMs)- based code generation,…

Software Engineering · Computer Science 2023-10-11 Zhenlan Ji , Pingchuan Ma , Zongjie Li , Shuai Wang

CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models

Causal reasoning capabilities are essential for large language models (LLMs) in a wide range of applications, such as education and healthcare. But there is still a lack of benchmarks for a better understanding of such capabilities. Current…

Computation and Language · Computer Science 2024-12-25 Ruibo Tu , Hedvig Kjellström , Gustav Eje Henter , Cheng Zhang

The Fault in our Stars: Quality Assessment of Code Generation Benchmarks

Large Language Models (LLMs) are gaining popularity among software engineers. A crucial aspect of developing effective code generation LLMs is to evaluate these models using a robust benchmark. Evaluation benchmarks with quality issues can…

Software Engineering · Computer Science 2024-09-05 Mohammed Latif Siddiq , Simantika Dristi , Joy Saha , Joanna C. S. Santos

A Critical Review of Causal Reasoning Benchmarks for Large Language Models

Numerous benchmarks aim to evaluate the capabilities of Large Language Models (LLMs) for causal inference and reasoning. However, many of them can likely be solved through the retrieval of domain knowledge, questioning whether they achieve…

Machine Learning · Computer Science 2024-07-12 Linying Yang , Vik Shirvaikar , Oscar Clivio , Fabian Falck

Prompting or Fine-tuning? Exploring Large Language Models for Causal Graph Validation

This study explores the capability of Large Language Models (LLMs) to evaluate causality in causal graphs generated by conventional statistical causal discovery methods-a task traditionally reliant on manual assessment by human subject…

Computation and Language · Computer Science 2025-04-16 Yuni Susanti , Nina Holsmoelle

Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach

Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated…

Software Engineering · Computer Science 2025-05-13 Longtian Wang , Tianlin Li , Xiaofei Xie , Yuhan Zhi , Jian Wang , Chao Shen

A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion.…

Software Engineering · Computer Science 2024-04-18 Quanjun Zhang , Tongke Zhang , Juan Zhai , Chunrong Fang , Bowen Yu , Weisong Sun , Zhenyu Chen

Causal Reasoning and Large Language Models: Opening a New Frontier for Causality

The causal capabilities of large language models (LLMs) are a matter of significant debate, with critical implications for the use of LLMs in societally impactful domains such as medicine, science, law, and policy. We conduct a "behavorial"…

Artificial Intelligence · Computer Science 2024-08-21 Emre Kıcıman , Robert Ness , Amit Sharma , Chenhao Tan

Re-Evaluating Code LLM Benchmarks Under Semantic Mutation

In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…

Software Engineering · Computer Science 2025-06-24 Zhiyuan Pan , Xing Hu , Xin Xia , Xiaohu Yang

Ice Cream Doesn't Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference

Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical…

Artificial Intelligence · Computer Science 2026-05-13 Jin Du , Li Chen , Xun Xian , An Luo , Fangqiao Tian , Ganghua Wang , Charles Doss , Xiaotong Shen , Jie Ding

On the Evaluation of Large Language Models in Unit Test Generation

Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers…

Software Engineering · Computer Science 2024-09-26 Lin Yang , Chen Yang , Shutao Gao , Weijing Wang , Bo Wang , Qihao Zhu , Xiao Chu , Jianyi Zhou , Guangtai Liang , Qianxiang Wang , Junjie Chen

Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models

Generative Large Language Models (LLMs) are increasingly used in non-generative software maintenance tasks, such as fault localization (FL). Success in FL depends on a models ability to reason about program semantics beyond surface-level…

Software Engineering · Computer Science 2026-03-06 Sabaat Haroon , Ahmad Faraz Khan , Ahmad Humayun , Waris Gill , Abdul Haddi Amjad , Ali R. Butt , Mohammad Taha Khan , Muhammad Ali Gulzar

Investigating the Efficacy of Large Language Models for Code Clone Detection

Large Language Models (LLMs) have demonstrated remarkable success in various natural language processing and software engineering tasks, such as code generation. The LLMs are mainly utilized in the prompt-based zero/few-shot paradigm to…

Software Engineering · Computer Science 2024-01-31 Mohamad Khajezade , Jie JW Wu , Fatemeh Hendijani Fard , Gema Rodríguez-Pérez , Mohamed Sami Shehata

Prompt engineering does not universally improve Large Language Model performance across clinical decision-making tasks

Large Language Models (LLMs) have demonstrated promise in medical knowledge assessments, yet their practical utility in real-world clinical decision-making remains underexplored. In this study, we evaluated the performance of three…

Computation and Language · Computer Science 2025-12-30 Mengdi Chai , Ali R. Zomorrodi

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure…

Software Engineering · Computer Science 2023-11-01 Jiawei Liu , Chunqiu Steven Xia , Yuyao Wang , Lingming Zhang

Reporting LLM Prompting in Automated Software Engineering: A Guideline Based on Current Practices and Expectations

Large Language Models, particularly decoder-only generative models such as GPT, are increasingly used to automate Software Engineering tasks. These models are primarily guided through natural language prompts, making prompt engineering a…

Software Engineering · Computer Science 2026-01-06 Alexander Korn , Lea Zaruchas , Chetan Arora , Andreas Metzger , Sven Smolka , Fanyu Wang , Andreas Vogelsang

Assisting Research Proposal Writing with Large Language Models: Evaluation and Refinement

Large language models (LLMs) like ChatGPT are increasingly used in academic writing, yet issues such as incorrect or fabricated references raise ethical concerns. Moreover, current content quality evaluations often rely on subjective human…

Computation and Language · Computer Science 2025-09-15 Jing Ren , Weiqi Wang

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks. This issue is especially critical for…

Computation and Language · Computer Science 2024-04-05 Chunyuan Deng , Yilun Zhao , Xiangru Tang , Mark Gerstein , Arman Cohan

Evaluating Large Language Models for Code Translation: Effects of Prompt Language and Prompt Design

Large language models (LLMs) have shown promise for automated source-code translation, a capability critical to software migration, maintenance, and interoperability. Yet comparative evidence on how model choice, prompt design, and prompt…

Software Engineering · Computer Science 2025-09-17 Aamer Aljagthami , Mohammed Banabila , Musab Alshehri , Mohammed Kabini , Mohammad D. Alahmadi

Reliability of Large Language Models for Design Synthesis: An Empirical Study of Variance, Prompt Sensitivity, and Method Scaffolding

Large Language Models (LLMs) are increasingly applied to automate software engineering tasks, including the generation of UML class diagrams from natural language descriptions. While prior work demonstrates that LLMs can produce…

Software Engineering · Computer Science 2026-04-07 Rabia Iftikhar , Andreas Rausch