Related papers: How Robustly do LLMs Understand Execution Semantic…

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

LLMs have made significant progress in the field of mathematical reasoning, but whether they have true the mathematical understanding ability is still controversial. To explore this issue, we propose a new perturbation framework to evaluate…

Artificial Intelligence · Computer Science 2025-11-12 Zhishen Sun , Guang Dai , Ivor Tsang , Haishan Ye

Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate…

Software Engineering · Computer Science 2025-12-30 Mootez Saad , Boqi Chen , José Antonio Hernández López , Dániel Varró , Tushar Sharma

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and…

Computation and Language · Computer Science 2025-06-13 Jaechul Roh , Varun Gandhi , Shivani Anilkumar , Arin Garg

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their…

Artificial Intelligence · Computer Science 2025-11-21 Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh , Maxwell Horton , Samy Bengio , Mehrdad Farajtabar

Exploring LLM Reasoning Through Controlled Prompt Variations

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how…

Artificial Intelligence · Computer Science 2025-04-04 Giannis Chatziveroglou , Richard Yun , Maura Kelleher

Are Large Language Models Robust in Understanding Code Against Semantics-Preserving Mutations?

With the widespread adoption of vibe coding, understanding the reasoning and robustness of Large Language Models (LLMs) is critical for their reliable use in programming tasks. While recent studies assess LLMs' ability to predict program…

Software Engineering · Computer Science 2026-05-08 Pedro Orvalho , Marta Kwiatkowska

Robust Reasoning Benchmark

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of…

Machine Learning · Computer Science 2026-05-22 Pavel Golikov , Evgenii Opryshko , Gennady Pekhimenko , Mark C. Jeffrey

Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions

Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness…

Computation and Language · Computer Science 2024-11-05 Pengfei Hong , Navonil Majumder , Deepanway Ghosal , Somak Aditya , Rada Mihalcea , Soujanya Poria

To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5%…

Software Engineering · Computer Science 2025-01-09 Benjamin Steenhoek , Md Mahbubur Rahman , Monoshi Kumar Roy , Mirza Sanjida Alam , Hengbo Tong , Swarna Das , Earl T. Barr , Wei Le

Improving the Robustness of Large Language Models for Code Tasks via Fine-tuning with Perturbed Data

Context: In the fast-paced evolution of software development, Large Language Models (LLMs) have become indispensable tools for tasks such as code generation, completion, analysis, and bug fixing. Ensuring the robustness of these models…

Software Engineering · Computer Science 2026-02-13 Yang Liu , Armstrong Foundjem , Xingfang Wu , Heng Li , Foutse Khomh

Patterns Over Principles: The Fragility of Inductive Reasoning in LLMs under Noisy Observations

Inductive reasoning, a cornerstone of human cognition, enables generalization from limited data but hasn't yet been fully achieved by large language models (LLMs). While modern LLMs excel at reasoning tasks, their ability to maintain stable…

Artificial Intelligence · Computer Science 2025-05-29 Chunyang Li , Weiqi Wang , Tianshi Zheng , Yangqiu Song

Demystifying Errors in LLM Reasoning Traces: An Empirical Study of Code Execution Simulation

Understanding a program's runtime reasoning behavior, meaning how intermediate states and control flows lead to final execution results, is essential for reliable code generation, debugging, and automated reasoning. Although large language…

Software Engineering · Computer Science 2025-12-02 Mohammad Abdollahi , Khandaker Rifah Tasnia , Soumit Kanti Saha , Jinqiu Yang , Song Wang , Hadi Hemmati

Large Language Models are Algorithmically Blind

Large language models (LLMs) demonstrate remarkable breadth of knowledge, yet their ability to reason about computational processes remains poorly understood. Closing this gap matters for practitioners who rely on LLMs to guide algorithm…

Computation and Language · Computer Science 2026-04-07 Sohan Venkatesh , Ashish Mahendran Kurapath , Tejas Melkote

Assessing Coherency and Consistency of Code Execution Reasoning by Large Language Models

This paper proposes CES, a task to evaluate the abilities of LLMs in simulating program execution and using that reasoning in programming tasks. Besides measuring the correctness of variable predictions during execution simulation, CES…

Software Engineering · Computer Science 2026-04-08 Changshu Liu , Yang Chen , Reyhaneh Jabbarvand

Reasoning Models Reason Well, Until They Don't

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings…

Artificial Intelligence · Computer Science 2025-10-28 Revanth Rameshkumar , Jimson Huang , Yunxin Sun , Fei Xia , Abulhair Saparov

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and…

Computation and Language · Computer Science 2025-10-14 Jialu Du , Guiyang Hou , Yihui Fu , Chen Wu , Wenqi Zhang , Yongliang Shen , Weiming Lu

Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets

Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations, causing concerns about trust. To enhance trust, it is imperative to gain a comprehensive understanding of the…

Computation and Language · Computer Science 2025-01-03 Vatsal Gupta , Pranshu Pandya , Tushar Kataria , Vivek Gupta , Dan Roth

CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with…

Artificial Intelligence · Computer Science 2025-10-14 Man Ho Lam , Chaozheng Wang , Jen-tse Huang , Michael R. Lyu

RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning

Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. While the progress is promising, it is currently unclear if these models…

Computation and Language · Computer Science 2022-11-09 Soumya Sanyal , Zeyi Liao , Xiang Ren