Related papers: Exposing Weaknesses of Large Reasoning Models thro…

Rethinking and Benchmarking Large Language Models for Graph Reasoning

Large Language Models (LLMs) for Graph Reasoning have been extensively studied over the past two years, involving enabling LLMs to understand graph structures and reason on graphs to solve various graph problems, with graph algorithm…

Artificial Intelligence · Computer Science 2025-10-03 Yuwei Hu , Xinyi Huang , Zhewei Wei , Yongchao Liu , Chuntao Hong

AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for…

Artificial Intelligence · Computer Science 2026-01-12 Henan Sun , Kaichi Yu , Yuyao Wang , Bowen Liu , Xunkai Li , Rong-Hua Li , Nuo Chen , Jia Li

Reasoning Models Reason Well, Until They Don't

Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings…

Artificial Intelligence · Computer Science 2025-10-28 Revanth Rameshkumar , Jimson Huang , Yunxin Sun , Fei Xia , Abulhair Saparov

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Recent generations of language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their…

Artificial Intelligence · Computer Science 2025-11-21 Parshin Shojaee , Iman Mirzadeh , Keivan Alizadeh , Maxwell Horton , Samy Bengio , Mehrdad Farajtabar

LR^2Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems

Recent progress in Large Reasoning Models (LRMs) has significantly enhanced the reasoning abilities of Large Language Models (LLMs), empowering them to tackle increasingly complex tasks through reflection capabilities, such as making…

Computation and Language · Computer Science 2025-06-26 Jianghao Chen , Zhenlin Wei , Zhenjiang Ren , Ziyong Li , Jiajun Zhang

GraphRAG-Bench: Challenging Domain-Specific Reasoning for Evaluating Graph Retrieval-Augmented Generation

Graph Retrieval Augmented Generation (GraphRAG) has garnered increasing recognition for its potential to enhance large language models (LLMs) by structurally organizing domain-specific corpora and facilitating complex reasoning. However,…

Computation and Language · Computer Science 2025-06-23 Yilin Xiao , Junnan Dong , Chuang Zhou , Su Dong , Qian-wen Zhang , Di Yin , Xing Sun , Xiao Huang

GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains…

Artificial Intelligence · Computer Science 2026-02-11 Shixian Luo , Zezhou Zhu , Yu Yuan , Yuncheng Yang , Lianlei Shan , Yong Wu

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for…

Machine Learning · Computer Science 2026-02-12 Yu He , Yingxi Li , Colin White , Ellen Vitercik

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

IOLBENCH: Benchmarking LLMs on Linguistic Reasoning

Despite the remarkable advancements and widespread applications of deep neural networks, their ability to perform reasoning tasks remains limited, particularly in domains requiring structured, abstract thought. In this paper, we investigate…

Computation and Language · Computer Science 2025-09-16 Satyam Goyal , Soham Dan

GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models

Evaluating the graph comprehension and reasoning abilities of Large Language Models (LLMs) is challenging and often incomplete. Existing benchmarks focus primarily on pure graph understanding, lacking a comprehensive evaluation across all…

Artificial Intelligence · Computer Science 2025-02-27 Zike Yuan , Ming Liu , Hui Wang , Bing Qin

LongReasonArena: A Long Reasoning Benchmark for Large Language Models

Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark…

Computation and Language · Computer Science 2025-08-28 Jiayu Ding , Shuming Ma , Lei Cui , Nanning Zheng , Furu Wei

A Graph Talks, But Who's Listening? Rethinking Evaluations for Graph-Language Models

Developments in Graph-Language Models (GLMs) aim to integrate the structural reasoning capabilities of Graph Neural Networks (GNNs) with the semantic understanding of Large Language Models (LLMs). However, we demonstrate that current…

Computation and Language · Computer Science 2025-08-29 Soham Petkar , Hari Aakash K , Anirudh Vempati , Akshit Sinha , Ponnurangam Kumarauguru , Chirag Agarwal

Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate…

Computation and Language · Computer Science 2026-04-16 Md. Fahad Ullah Utsho , Mohd. Ruhul Ameen , Akif Islam , Md. Golam Rashed , Dipankar Das

GraphReason: Enhancing Reasoning Capabilities of Large Language Models through A Graph-Based Verification Approach

Large Language Models (LLMs) have showcased impressive reasoning capabilities, particularly when guided by specifically designed prompts in complex reasoning tasks such as math word problems. These models typically solve tasks using a…

Artificial Intelligence · Computer Science 2024-04-23 Lang Cao

GraphLLM: Boosting Graph Reasoning Ability of Large Language Model

The advancement of Large Language Models (LLMs) has remarkably pushed the boundaries towards artificial general intelligence (AGI), with their exceptional ability on understanding diverse types of information, including but not limited to…

Computation and Language · Computer Science 2023-10-10 Ziwei Chai , Tianjie Zhang , Liang Wu , Kaiqiao Han , Xiaohai Hu , Xuanwen Huang , Yang Yang

THINK-Bench: Evaluating Thinking Efficiency and Chain-of-Thought Quality of Large Reasoning Models

Large reasoning models (LRMs) have achieved impressive performance in complex tasks, often outperforming conventional large language models (LLMs). However, the prevalent issue of overthinking severely limits their computational efficiency.…

Computation and Language · Computer Science 2025-05-29 Zhiyuan Li , Yi Chang , Yuan Wu

Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic…

Computation and Language · Computer Science 2025-06-04 Haoyang Li , Xuejia Chen , Zhanchao XU , Darian Li , Nicole Hu , Fei Teng , Yiming Li , Luyu Qiu , Chen Jason Zhang , Qing Li , Lei Chen

LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus…

Computation and Language · Computer Science 2025-11-19 Zhan Ling , Kang Liu , Kai Yan , Yifan Yang , Weijian Lin , Ting-Han Fan , Lingfeng Shen , Zhengyin Du , Jiecao Chen

GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models

Currently, process reward models (PRMs) have exhibited remarkable potential for test-time scaling. Since large language models (LLMs) regularly generate flawed intermediate reasoning steps when tackling a broad spectrum of reasoning and…

Artificial Intelligence · Computer Science 2026-05-08 Zhouhao Sun , Xuan Zhang , Xiao Ding , Bibo Cai , Li Du , Kai Xiong , Xinran Dai , Fei Zhang , weidi tang , Zhiyuan Kan , Yang Zhao , Bing Qin , Ting Liu