Related papers: Dynamic Stability of LLM-Generated Code

Correctness isnt Efficiency: Runtime Memory Divergence in LLM-Generated Code

Large language models (LLMs) can generate programs that pass unit tests, but passing tests does not guarantee reliable runtime behavior. We find that different correct solutions to the same task can show very different memory and…

Software Engineering · Computer Science 2026-02-03 Prateek Rajput , Yewei Song , Abdoul Aziz Bonkoungou , Iyiola E. Olatunji , Abdoul Kader Kabore , Jacques Klein , Tegawendé F. Bissyandé

Measuring LLM Code Generation Stability via Structural Entropy

Assessing the stability of code generation from large language models (LLMs) is essential for judging their reliability in real-world development. We extend prior "structural-entropy concepts" to the program domain by pairing entropy with…

Software Engineering · Computer Science 2025-08-21 Yewei Song , Tiezhu Sun , Xunzhu Tang , Prateek Rajput , Tegawende F. Bissyande , Jacques Klein

Analyzing the Instability of Large Language Models in Automated Bug Injection and Correction

The use of Large Language Models (LLMs) in software engineering tasks is growing, especially in the areas of bug fixing and code generation. Nevertheless, these models often yield unstable results; when executed at different times with the…

Software Engineering · Computer Science 2025-09-09 Mehmet Bilal Er , Nagehan İlhan , Umut Kuran

Using Semantic Distance to Estimate Uncertainty in LLM-Based Code Generation

LLMs show strong performance in code generation, but their outputs lack correctness guarantees. Sample-based uncertainty estimators address this by generating multiple candidate programs and measuring their disagreement. However, existing…

Software Engineering · Computer Science 2026-05-12 Weilin He , Arindam Sharma , Cristina David

Static Analysis as a Feedback Loop: Enhancing LLM-Generated Code Beyond Correctness

Large language models (LLMs) have demonstrated impressive capabilities in code generation, achieving high scores on benchmarks such as HumanEval and MBPP. However, these benchmarks primarily assess functional correctness and neglect broader…

Software Engineering · Computer Science 2025-08-21 Scott Blyth , Sherlock A. Licorish , Christoph Treude , Markus Wagner

Stability as a Liability:Systematic Breakdown of Linguistic Structure in LLMs

Training stability is typically regarded as a prerequisite for reliable optimization in large language models. In this work, we analyze how stabilizing training dynamics affects the induced generation distribution. We show that under…

Artificial Intelligence · Computer Science 2026-02-10 Xianzhe Meng , Qiangsheng Zeng , Ling Luo , Qinghan Yang , Jiarui Hao , Wenbo Wu , Qinyu Wang , Rui Yin , Lin Qi , Renzhi Lu

Scalable and Efficient Continual Learning from Demonstration via a Hypernetwork-generated Stable Dynamics Model

Robots capable of learning from demonstration (LfD) must exhibit stability while executing learned motion skills. To be effective in the real world, they should also remember multiple skills over time -- a capability lacking in current…

Robotics · Computer Science 2026-05-12 Sayantan Auddy , Jakob Hollenstein , Matteo Saveriano , Antonio Rodríguez-Sánchez , Justus Piater

When Stability Fails: Hidden Failure Modes Of LLMS in Data-Constrained Scientific Decision-Making

Large language models (LLMs) are increasingly used as decision-support tools in data-constrained scientific workflows, where correctness and validity are critical. However, evaluation practices often emphasize stability or reproducibility…

Machine Learning · Computer Science 2026-03-18 Nazia Riasat

STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in…

Computation and Language · Computer Science 2026-01-01 Guanghui Wang , Jinze Yu , Xing Zhang , Dayuan Jiang , Yin Song , Tomal Deb , Xuefeng Liu , Peiyang He

LLM-Based Static Verification of Code Against Natural-Language Requirements: An Industrial Experience Report

Large language models (LLMs) are increasingly used to generate requirements specifications, design documents, code, and test cases. In contrast, much less attention has been given to a more difficult assurance problem: statically verifying…

Software Engineering · Computer Science 2026-05-19 Zhi Quan Zhou , Dave Towey , Tsong Yueh Chen

Helping LLMs Improve Code Generation Using Feedback from Testing and Static Analysis

Large Language Models (LLMs) are one of the most promising developments in the field of artificial intelligence, and the software engineering community has readily noticed their potential role in the software development life-cycle.…

Software Engineering · Computer Science 2026-03-16 Greta Dolcetti , Vincenzo Arceri , Eleonora Iotti , Sergio Maffeis , Agostino Cortesi , Enea Zaffanella

A Stochastic Differential Equation Framework for Multi-Objective LLM Interactions: Dynamical Systems Analysis with Code Generation Applications

We introduce a general stochastic differential equation framework for modelling multiobjective optimization dynamics in iterative Large Language Model (LLM) interactions. Our framework captures the inherent stochasticity of LLM responses…

Machine Learning · Computer Science 2025-10-14 Shivani Shukla , Himanshu Joshi

Rethinking the Evaluation of Secure Code Generation

Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current…

Cryptography and Security · Computer Science 2025-11-14 Shih-Chieh Dai , Jun Xu , Guanhong Tao

Stability-Weighted Decoding for Diffusion Language Models

Diffusion large language models (dLLMs) enable parallel text generation by iteratively denoising a fully masked sequence, unmasking a subset of masked tokens at each step. Existing decoding strategies rely on static confidence metrics…

Computation and Language · Computer Science 2026-04-21 Yue Wu , Jian Huang

Is Functional Correctness Enough to Evaluate Code Language Models? Exploring Diversity of Generated Codes

Language models (LMs) have exhibited impressive abilities in generating codes from natural language requirements. In this work, we highlight the diversity of code generated by LMs as a critical criterion for evaluating their code generation…

Software Engineering · Computer Science 2024-08-28 Heejae Chon , Seonghyeon Lee , Jinyoung Yeo , Dongha Lee

Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations

Code generation models are widely used in software development, yet their sensitivity to prompt phrasing remains under-examined. Identical requirements expressed with different emotions or communication styles can yield divergent outputs,…

Software Engineering · Computer Science 2025-09-18 Wei Ma , Yixiao Yang , Jingquan Ge , Xiaofei Xie , Lingxiao Jiang

LLM-ML Teaming: Integrated Symbolic Decoding and Gradient Search for Valid and Stable Generative Feature Transformation

Feature transformation enhances data representation by deriving new features from the original data. Generative AI offers potential for this task, but faces challenges in stable generation (consistent outputs) and valid generation…

Machine Learning · Computer Science 2025-06-12 Xinyuan Wang , Haoyue Bai , Nanxu Gong , Wangyang Ying , Sixun Dong , Xiquan Cui , Yanjie Fu

Evaluating Source Code Quality with Large Language Models: a comparative study

Code quality is an attribute composed of various metrics, such as complexity, readability, testability, interoperability, reusability, and the use of good or bad practices, among others. Static code analysis tools aim to measure a set of…

Software Engineering · Computer Science 2024-10-07 Igor Regis da Silva Simões , Elaine Venson

Statistical Multicriteria Evaluation of LLM-Generated Text

Assessing the quality of LLM-generated text remains a fundamental challenge in natural language processing. Current evaluation approaches often rely on isolated metrics or simplistic aggregations that fail to capture the nuanced trade-offs…

Computation and Language · Computer Science 2025-06-25 Esteban Garces Arias , Hannah Blocher , Julian Rodemann , Matthias Aßenmacher , Christoph Jansen

Assessing Large Language Models for Stabilizing Numerical Expressions in Scientific Software

Scientific software relies on high-precision computation, yet finite floating-point representations can introduce precision errors that propagate in safety-critical domains. Despite the growing use of large language models (LLMs) in…

Software Engineering · Computer Science 2026-04-10 Tien Nguyen , Kirshanthan Sundararajah , Muhammad Ali Gulzar