English
Related papers

Related papers: Robust Reasoning Benchmark

200 papers

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce…

Computation and Language · Computer Science 2026-01-22 Reza Khanmohammadi , Erfan Miahi , Simerjot Kaur , Ivan Brugere , Charese H. Smiley , Kundan Thind , Mohammad M. Ghassemi

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, "frozen world" settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the…

Computation and Language · Computer Science 2025-10-17 Tsung-Han Wu , Mihran Miroyan , David M. Chan , Trevor Darrell , Narges Norouzi , Joseph E. Gonzalez

This study investigates the reasoning robustness of large language models (LLMs) on mathematical problem-solving tasks under systematically introduced input perturbations. Using the GSM8K dataset as a controlled testbed, we evaluate how…

Artificial Intelligence · Computer Science 2025-04-04 Giannis Chatziveroglou , Richard Yun , Maura Kelleher

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate…

Computation and Language · Computer Science 2026-04-16 Md. Fahad Ullah Utsho , Mohd. Ruhul Ameen , Akif Islam , Md. Golam Rashed , Dipankar Das

LLMs demonstrate remarkable reasoning capabilities, yet whether they utilize internal world models or rely on sophisticated pattern matching remains open. We study LLMs through the lens of robustness of their code understanding using a…

Software Engineering · Computer Science 2026-04-21 Claudio Spiess , Prem Devanbu , Earl T. Barr

Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks. Yet, growing evidence shows that these models often generate longer but ineffective chains of thought…

Machine Learning · Computer Science 2025-07-02 Jhouben Cuesta-Ramirez , Samuel Beaussant , Mehdi Mounsif

Large language models (LLMs) increasingly assist software engineering tasks that require reasoning over long code contexts, yet their robustness under varying input conditions remains unclear. We conduct a systematic study of long-context…

Software Engineering · Computer Science 2026-02-20 Kishan Maharaj , Nandakishore Menon , Ashita Saxena , Srikanth Tamilselvam

Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial…

Cryptography and Security · Computer Science 2026-02-09 Li Lu , Yanjie Zhao , Hongzhou Rao , Kechi Zhang , Haoyu Wang

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their…

Artificial Intelligence · Computer Science 2025-11-21 Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh , Maxwell Horton , Samy Bengio , Mehrdad Farajtabar

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making…

Computation and Language · Computer Science 2025-06-26 Jianghao Chen , Zhenlin Wei , Zhenjiang Ren , Ziyong Li , Jiajun Zhang

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under…

Artificial Intelligence · Computer Science 2026-03-13 Yubo Li , Ramayya Krishnan , Rema Padman

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and…

Computation and Language · Computer Science 2025-06-13 Jaechul Roh , Varun Gandhi , Shivani Anilkumar , Arin Garg

Despite the recent success of large language models (LLMs) in reasoning such as DeepSeek, we for the first time identify a key dilemma in reasoning robustness and generalization: significant performance degradation on novel or incomplete…

Artificial Intelligence · Computer Science 2025-03-07 Tong Yu , Yongcheng Jing , Xikun Zhang , Wentao Jiang , Wenjie Wu , Yingjie Wang , Wenbin Hu , Bo Du , Dacheng Tao

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Jiaqi Tang , Jianmin Chen , Wei Wei , Xiaogang Xu , Runtao Liu , Xiangyu Wu , Qipeng Xie , Jiafei Wu , Lei Zhang , Qifeng Chen

Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with…

Artificial Intelligence · Computer Science 2025-10-14 Man Ho Lam , Chaozheng Wang , Jen-tse Huang , Michael R. Lyu

Recent advancements in Large Reasoning Models (LRMs), such as OpenAI's o1/o3 and DeepSeek-R1, have demonstrated remarkable performance in specialized reasoning tasks through human-like deliberative thinking and long chain-of-thought…

Artificial Intelligence · Computer Science 2025-11-20 Weixiang Zhao , Xingyu Sui , Jiahe Guo , Yulin Hu , Yang Deng , Yanyan Zhao , Xuda Zhi , Yongbo Huang , Hao He , Wanxiang Che , Ting Liu , Bing Qin

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and…

Artificial Intelligence · Computer Science 2025-09-30 Yue Yang , MingKang Chen , Qihua Liu , Mengkang Hu , Qiguang Chen , Gengrui Zhang , Shuyue Hu , Guangtao Zhai , Yu Qiao , Yu Wang , Wenqi Shao , Ping Luo

The scaling of Large Language Models (LLMs) has exposed a critical gap between their performance on static benchmarks and their fragility in dynamic, information-rich environments. While models excel at isolated tasks, the computational…

Artificial Intelligence · Computer Science 2025-09-29 Sai Teja Reddy Adapala

Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on…

Machine Learning · Computer Science 2025-10-08 Andreas Hochlehnert , Hardik Bhatnagar , Vishaal Udandarao , Samuel Albanie , Ameya Prabhu , Matthias Bethge
‹ Prev 1 2 3 10 Next ›