Related papers: Correctness isnt Efficiency: Runtime Memory Diverg…

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing…

Software Engineering · Computer Science 2026-05-12 Weilin He , Arindam Sharma , Cristina David

Dynamic Stability of LLM-Generated Code

Current evaluations of LLMs for code generation emphasize functional correctness, overlooking the fact that functionally correct solutions can differ significantly in algorithmic complexity. For instance, an $(O(n^2))$ versus $(O(n \log…

Programming Languages · Computer Science 2025-11-12 Prateek Rajput , Abdoul Aziz Bonkoungou , Yewei Song , Abdoul Kader Kabore , Iyiola E. Olatunji , Jacques Klein , Tegewende Bissyande

Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction

The use of Large Language Models (LLMs) in software engineering tasks is growing, especially in the areas of bug fixing and code generation. Nevertheless, these models often yield unstable results; when executed at different times with the…

Software Engineering · Computer Science 2025-09-09 Mehmet Bilal Er , Nagehan İlhan , Umut Kuran

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering

Large Language Models (LLMs) changed the way we design and interact with software systems. Their ability to process and extract information from text has drastically improved productivity in a number of routine tasks. Developers that want…

Machine Learning · Computer Science 2025-08-26 Federico Errica , Giuseppe Siracusano , Davide Sanvito , Roberto Bifulco

Measuring Reliability of Large Language Models through Semantic Consistency

While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when…

Computation and Language · Computer Science 2023-04-13 Harsh Raj , Domenic Rosati , Subhabrata Majumdar

When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility…

Machine Learning · Computer Science 2026-03-18 Nazia Riasat

Sustainable Code Generation Using Large Language Models: A Systematic Literature Review

Large Language Models (LLMs) are widely used in software engineering to generate, complete, translate, and fix code, improving developer productivity. While most research focuses on the energy consumption and carbon emissions of model…

Software Engineering · Computer Science 2026-04-15 Sabiya Banu Masthan Ali , Oussema Kirmani , Aroosa Hameed , Syed Muhammad Danish , Gautam Srivastava

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

Large language models (LLMs) are increasingly deployed under diverse numerical precision configurations, including standard floating-point formats (e.g., bfloat16 and float16) and quantized integer formats (e.g., int16 and int8), to meet…

Artificial Intelligence · Computer Science 2026-04-23 Yifei Wang , Tianlin Li , Xiaohan Zhang , Xiaoyu Zhang , Wei Ma , Mingfei Cheng , Li Pan

When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity,…

Software Engineering · Computer Science 2025-07-29 Maya Larbi , Amal Akli , Mike Papadakis , Rihab Bouyousfi , Maxime Cordy , Federica Sarro , Yves Le Traon

To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5%…

Software Engineering · Computer Science 2025-01-09 Benjamin Steenhoek , Md Mahbubur Rahman , Monoshi Kumar Roy , Mirza Sanjida Alam , Hengbo Tong , Swarna Das , Earl T. Barr , Wei Le

Programming Language Confusion: When Code LLMs Can't Keep their Languages Straight

Large Language Models (LLMs) have achieved state-of-the-art performance across software engineering tasks, from code generation to translation. However, we identify and systematically evaluate a critical failure mode: Programming Language…

Software Engineering · Computer Science 2026-02-03 Micheline Bénédicte Moumoula , Serge Lionel Nikiema , Abdoul Kader Kabore , Jacques Klein , Tegawendé F. Bissyande

Improving the Robustness of Large Language Models via Consistency Alignment

Large language models (LLMs) have shown tremendous success in following user instructions and generating helpful responses. Nevertheless, their robustness is still far from optimal, as they may generate significantly inconsistent responses…

Computation and Language · Computer Science 2024-03-25 Yukun Zhao , Lingyong Yan , Weiwei Sun , Guoliang Xing , Shuaiqiang Wang , Chong Meng , Zhicong Cheng , Zhaochun Ren , Dawei Yin

Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study

Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Yet, their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. While…

Software Engineering · Computer Science 2024-02-12 Yufei Li , Simin Chen , Yanghong Guo , Wei Yang , Yue Dong , Cong Liu

"I May Not Have Articulated Myself Clearly": Diagnosing Dynamic Instability in LLM Reasoning at Inference Time

Reasoning failures in large language models (LLMs) are typically measured only at the end of a generation, yet many failures manifest as a process-level breakdown: the model "loses the thread" mid-reasoning. We study whether such breakdowns…

Artificial Intelligence · Computer Science 2026-02-04 Jinkun Chen , Fengxiang Cheng , Sijia Han , Vlado Keselj

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

Context: In the fast-paced evolution of software development, Large Language Models (LLMs) have become indispensable tools for tasks such as code generation, completion, analysis, and bug fixing. Ensuring the robustness of these models…

Software Engineering · Computer Science 2026-02-13 Yang Liu , Armstrong Foundjem , Xingfang Wu , Heng Li , Foutse Khomh

Memory Consistency Models using Constraints

Memory consistency models (MCMs) are at the heart of concurrent programming. They represent the behaviour of concurrent programs at the chip level. To test these models small program snippets called litmus test are generated, which show…

Programming Languages · Computer Science 2018-08-30 Ruth Hoffmann , Özgür Akgün , Susmit Sarkar

Certainty robustness: Evaluating LLM stability under self-challenging prompts

Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness…

Computation and Language · Computer Science 2026-03-05 Mohammadreza Saadat , Steve Nemzer

Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation

Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language…

Software Engineering · Computer Science 2025-12-02 Mohammad Abdollahi , Khandaker Rifah Tasnia , Soumit Kanti Saha , Jinqiu Yang , Song Wang , Hadi Hemmati

Are Your LLMs Capable of Stable Reasoning?

The rapid advancement of large language models (LLMs) has shown remarkable progress in complex reasoning tasks. However, a significant disparity exists between benchmark performances and real-world applications. We attribute this gap…

Artificial Intelligence · Computer Science 2025-08-11 Junnan Liu , Hongwei Liu , Linchen Xiao , Ziyi Wang , Kuikun Liu , Songyang Gao , Wenwei Zhang , Songyang Zhang , Kai Chen

Calibration, Entropy Rates, and Memory in Language Models

Building accurate language models that capture meaningful long-term dependencies is a core challenge in natural language processing. Towards this end, we present a calibration-based approach to measure long-term discrepancies between a…

Computation and Language · Computer Science 2019-06-14 Mark Braverman , Xinyi Chen , Sham M. Kakade , Karthik Narasimhan , Cyril Zhang , Yi Zhang