Related papers: DiffSpec: Differential Testing with LLMs using Nat…

Hidden Reliability Risks in Large Language Models: Systematic Identification of Precision-Induced Output Disagreements

Large language models (LLMs) are increasingly deployed under diverse numerical precision configurations, including standard floating-point formats (e.g., bfloat16 and float16) and quantized integer formats (e.g., int16 and int8), to meet…

Artificial Intelligence · Computer Science 2026-04-23 Yifei Wang , Tianlin Li , Xiaohan Zhang , Xiaoyu Zhang , Wei Ma , Mingfei Cheng , Li Pan

Enhancing Differential Testing With LLMs For Testing Deep Learning Libraries

Differential testing offers a promising strategy to alleviate the test oracle problem by comparing the test results between alternative implementations. However, existing differential testing techniques for deep learning (DL) libraries are…

Software Engineering · Computer Science 2025-05-09 Meiziniu Li , Dongze Li , Jianmeng Liu , Jialun Cao , Yongqiang Tian , Shing-Chi Cheung

Finding Missed Code Size Optimizations in Compilers using LLMs

Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production…

Software Engineering · Computer Science 2025-01-03 Davide Italiano , Chris Cummins

Evaluating Language Models for Efficient Code Generation

We introduce Differential Performance Evaluation (DPE), a framework designed to reliably evaluate Large Language Models (LLMs) for efficient code generation. Traditional coding benchmarks often fail to provide reliable insights into code…

Software Engineering · Computer Science 2024-08-14 Jiawei Liu , Songrun Xie , Junhao Wang , Yuxiang Wei , Yifeng Ding , Lingming Zhang

Sharpen the Spec, Cut the Code: A Case for Generative File System with SYSSPEC

File systems are critical OS components that require constant evolution to support new hardware and emerging application needs. However, the traditional paradigm of developing features, fixing bugs, and maintaining the system incurs…

Operating Systems · Computer Science 2026-02-11 Qingyuan Liu , Mo Zou , Hengbin Zhang , Dong Du , Yubin Xia , Haibo Chen

AutoReSpec: A Framework for Generating Specification using Large Language Models

Formal specification generation has recently drawn attention in software engineering as a way to improve program correctness without requiring manual annotations. Large Language Models (LLMs) have shown promise in this area, but early…

Software Engineering · Computer Science 2026-04-07 Ragib Shahariar Ayon , Shibbir Ahmed

The Rarity Blind Spot: A Framework for Evaluating Statistical Reasoning in LLMs

Effective decision-making often relies on identifying what makes each candidate distinctive. While existing benchmarks for LLMs emphasize retrieving or summarizing information relevant to a given query, they do not evaluate a model's…

Computation and Language · Computer Science 2025-10-02 Seiji Maekawa , Hayate Iso , Nikita Bhutani

DocPrism: Local Categorization and External Filtering to Identify Relevant Code-Documentation Inconsistencies

Code-documentation inconsistencies are common and undesirable: they can lead to developer misunderstandings and software defects. This paper introduces DocPrism, a multi-language, code-documentation inconsistency detection tool. DocPrism…

Software Engineering · Computer Science 2025-11-04 Xiaomeng Xu , Zahin Wahab , Reid Holmes , Caroline Lemieux

Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

Large language models (LLMs), such as OpenAI's Codex, have demonstrated their potential to generate code from natural language descriptions across a wide range of programming tasks. Several benchmarks have recently emerged to evaluate the…

Software Engineering · Computer Science 2023-04-11 Sarah Fakhoury , Saikat Chakraborty , Madan Musuvathi , Shuvendu K. Lahiri

DISTINCT: A Description-Guided Branch-Consistency Analysis Framework for Non-Regressive Test Case Generation

Automated test-generation research overwhelmingly assumes the correctness of focal methods, yet practitioners routinely face non-regression scenarios where the focal method may be defective. A baseline evaluation of EVOSUITE and two leading…

Software Engineering · Computer Science 2026-02-03 Pengyu Xue , Yuxiang Zhang , Zhen Yang , Xiaoxue Ren , Xiang Li , Pengfei Hu , Linhao Wu , Kainan Li

PromptPex: Automatic Test Generation for Language Model Prompts

Large language models (LLMs) are being used in many applications and prompts for these models are integrated into software applications as code-like artifacts. These prompts behave much like traditional software in that they take inputs,…

Software Engineering · Computer Science 2026-02-09 Reshabh K Sharma , Jonathan De Halleux , Shraddha Barke , Dan Grossman , Benjamin Zorn

Understanding Defects in Generated Codes by Language Models

This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation,…

Software Engineering · Computer Science 2024-08-27 Ali Mohammadi Esfahani , Nafiseh Kahani , Samuel A. Ajila

Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction

Bug reproduction is a critical developer activity that is also challenging to automate, as bug reports are often in natural language and thus can be difficult to transform to test cases consistently. As a result, existing techniques mostly…

Software Engineering · Computer Science 2023-11-10 Sungmin Kang , Juyeon Yoon , Nargiz Askarbekkyzy , Shin Yoo

An Evalutation of Programming Language Models' performance on Software Defect Detection

This dissertation presents an evaluation of several language models on software defect datasets. A language Model (LM) "can provide word representation and probability indication of word sequences as the core component of an NLP system."…

Software Engineering · Computer Science 2019-09-24 Kailun Wang

A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings

With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work…

Software Engineering · Computer Science 2026-02-18 Simantika Bhattacharjee Dristi , Matthew B. Dwyer

Large Language Models are Edge-Case Fuzzers: Testing Deep Learning Libraries via FuzzGPT

Deep Learning (DL) library bugs affect downstream DL applications, emphasizing the need for reliable systems. Generating valid input programs for fuzzing DL libraries is challenging due to the need for satisfying both language…

Software Engineering · Computer Science 2023-04-05 Yinlin Deng , Chunqiu Steven Xia , Chenyuan Yang , Shizhuo Dylan Zhang , Shujing Yang , Lingming Zhang

Language Models can Evaluate Themselves via Probability Discrepancy

In this paper, we initiate our discussion by demonstrating how Large Language Models (LLMs), when tasked with responding to queries, display a more even probability distribution in their answers if they are more adept, as opposed to their…

Computation and Language · Computer Science 2024-07-10 Tingyu Xia , Bowen Yu , Yuan Wu , Yi Chang , Chang Zhou

Natural Language based Specification and Verification

Recent frontier large language models (LLMs) have shown strong performance in identifying security vulnerabilities in large, mature open-source systems. As LLM-generated code becomes increasingly common, a natural goal is to prevent such…

Software Engineering · Computer Science 2026-05-13 Zhaorui Li , Chengyu Song

Localized Calibrated Uncertainty in Code Language Models

Large Language models (LLMs) can generate complicated source code from natural language prompts. However, LLMs can generate output that deviates from what the user wants, requiring supervision and editing. To support this process, we offer…

Software Engineering · Computer Science 2026-01-01 David Gros , Prem Devanbu

Can LLMs Patch Security Issues?

Large Language Models (LLMs) have shown impressive proficiency in code generation. Unfortunately, these models share a weakness with their human counterparts: producing code that inadvertently has security vulnerabilities. These…

Cryptography and Security · Computer Science 2024-10-17 Kamel Alrashedy , Abdullah Aljasser , Pradyumna Tambwekar , Matthew Gombolay