Related papers: Robust Reasoning Benchmark

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce…

Computation and Language · Computer Science 2026-01-22 Reza Khanmohammadi , Erfan Miahi , Simerjot Kaur , Ivan Brugere , Charese H. Smiley , Kundan Thind , Mohammad M. Ghassemi

Are Large Reasoning Models Interruptible?

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the…

Computation and Language · Computer Science 2025-10-17 Tsung-Han Wu , Mihran Miroyan , David M. Chan , Trevor Darrell , Narges Norouzi , Joseph E. Gonzalez

Exploring LLM Reasoning Through Controlled Prompt Variations

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how…

Artificial Intelligence · Computer Science 2025-04-04 Giannis Chatziveroglou , Richard Yun , Maura Kelleher

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate…

Computation and Language · Computer Science 2026-04-16 Md. Fahad Ullah Utsho , Mohd. Ruhul Ameen , Akif Islam , Md. Golam Rashed , Dipankar Das

How Robustly do LLMs Understand Execution Semantics?

LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a…

Software Engineering · Computer Science 2026-04-21 Claudio Spiess , Prem Devanbu , Earl T. Barr

Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories

Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks. Yet, growing evidence shows that these models often generate longer but ineffective chains of thought…

Machine Learning · Computer Science 2025-07-02 Jhouben Cuesta-Ramirez , Samuel Beaussant , Mehdi Mounsif

Robustness and Reasoning Fidelity of Large Language Models in Long-Context Code Question Answering

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context…

Software Engineering · Computer Science 2026-02-20 Kishan Maharaj , Nandakishore Menon , Ashita Saxena , Srikanth Tamilselvam

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial…

Cryptography and Security · Computer Science 2026-02-09 Li Lu , Yanjie Zhao , Hongzhou Rao , Kechi Zhang , Haoyu Wang

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their…

Artificial Intelligence · Computer Science 2025-11-21 Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh , Maxwell Horton , Samy Bengio , Mehrdad Farajtabar

LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making…

Computation and Language · Computer Science 2025-06-26 Jianghao Chen , Zhenlin Wei , Zhenjiang Ren , Ziyong Li , Jiajun Zhang

Consistency of Large Reasoning Models Under Multi-Turn Attacks

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under…

Artificial Intelligence · Computer Science 2026-03-13 Yubo Li , Ramayya Krishnan , Rema Padman

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and…

Computation and Language · Computer Science 2025-06-13 Jaechul Roh , Varun Gandhi , Shivani Anilkumar , Arin Garg

Benchmarking Reasoning Robustness in Large Language Models

Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete…

Artificial Intelligence · Computer Science 2025-03-07 Tong Yu , Yongcheng Jing , Xikun Zhang , Wentao Jiang , Wenjie Wu , Yingjie Wang , Wenbin Hu , Bo Du , Dacheng Tao

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Jiaqi Tang , Jianmin Chen , Wei Wei , Xiaogang Xu , Runtao Liu , Xiangyu Wu , Qipeng Xie , Jiafei Wu , Lei Zhang , Qifeng Chen

CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with…

Artificial Intelligence · Computer Science 2025-10-14 Man Ho Lam , Chaozheng Wang , Jen-tse Huang , Michael R. Lyu

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought…

Artificial Intelligence · Computer Science 2025-11-20 Weixiang Zhao , Xingyu Sui , Jiahe Guo , Yulin Hu , Yang Deng , Yanyan Zhao , Xuda Zhi , Yongbo Huang , Hao He , Wanxiang Che , Ting Liu , Bing Qin

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and…

Artificial Intelligence · Computer Science 2025-09-30 Yue Yang , MingKang Chen , Qihua Liu , Mengkang Hu , Qiguang Chen , Gengrui Zhang , Shuyue Hu , Guangtao Zhai , Yu Qiao , Yu Wang , Wenqi Shao , Ping Luo

Cognitive Load Limits in Large Language Models: Benchmarking Multi-Hop Reasoning

The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational…

Artificial Intelligence · Computer Science 2025-09-29 Sai Teja Reddy Adapala

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on…

Machine Learning · Computer Science 2025-10-08 Andreas Hochlehnert , Hardik Bhatnagar , Vishaal Udandarao , Samuel Albanie , Ameya Prabhu , Matthias Bethge