Related papers: LLM-based HSE Compliance Assessment: Benchmark, Pe…

Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation

Recent advances in large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking. However, whether LLMs possess genuine fluid intelligence (i.e., the ability to reason abstractly and…

Artificial Intelligence · Computer Science 2025-09-30 Yue Yang , MingKang Chen , Qihua Liu , Mengkang Hu , Qiguang Chen , Gengrui Zhang , Shuyue Hu , Guangtao Zhai , Yu Qiao , Yu Wang , Wenqi Shao , Ping Luo

Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases

Recent advancements in reasoning-enhanced large language models (LLMs), such as DeepSeek-R1 and OpenAI-o3, have demonstrated significant progress. However, their application in professional medical contexts remains underexplored,…

Computation and Language · Computer Science 2025-03-11 Pengcheng Qiu , Chaoyi Wu , Shuyu Liu , Weike Zhao , Zhuoxia Chen , Hongfei Gu , Chuanjin Peng , Ya Zhang , Yanfeng Wang , Weidi Xie

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to…

Artificial Intelligence · Computer Science 2025-05-28 Jiakang Yuan , Tianshuo Peng , Yilei Jiang , Yiting Lu , Renrui Zhang , Kaituo Feng , Chaoyou Fu , Tao Chen , Lei Bai , Bo Zhang , Xiangyu Yue

HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning

Structure reasoning is a fundamental capability of large language models (LLMs), enabling them to reason about structured commonsense and answer multi-hop questions. However, existing benchmarks for structure reasoning mainly focus on…

Computation and Language · Computer Science 2025-03-04 Zhuohang Jiang , Pangjing Wu , Ziran Liang , Peter Q. Chen , Xu Yuan , Ye Jia , Jiancheng Tu , Chen Li , Peter H. F. Ng , Qing Li

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational…

Artificial Intelligence · Computer Science 2025-10-13 Hyundong Jin , Joonghyuk Hahn , Yo-Sub Han

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models

The advent of large language models (LLMs) and their adoption by the legal community has given rise to the question: what types of legal reasoning can LLMs perform? To enable greater study of this question, we present LegalBench: a…

Computation and Language · Computer Science 2023-08-23 Neel Guha , Julian Nyarko , Daniel E. Ho , Christopher Ré , Adam Chilton , Aditya Narayana , Alex Chohlas-Wood , Austin Peters , Brandon Waldon , Daniel N. Rockmore , Diego Zambrano , Dmitry Talisman , Enam Hoque , Faiz Surani , Frank Fagan , Galit Sarfaty , Gregory M. Dickinson , Haggai Porat , Jason Hegland , Jessica Wu , Joe Nudell , Joel Niklaus , John Nay , Jonathan H. Choi , Kevin Tobia , Margaret Hagan , Megan Ma , Michael Livermore , Nikon Rasumov-Rahe , Nils Holzenberger , Noam Kolt , Peter Henderson , Sean Rehaag , Sharad Goel , Shang Gao , Spencer Williams , Sunny Gandhi , Tom Zur , Varun Iyer , Zehua Li

Mechanics of Learned Reasoning 1: TempoBench, A Benchmark for Interpretable Deconstruction of Reasoning System Performance

Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the…

Artificial Intelligence · Computer Science 2025-11-03 Nikolaus Holzer , William Fishell , Baishakhi Ray , Mark Santolucito

seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation…

Artificial Intelligence · Computer Science 2025-09-23 Mohammad Ramezanali , Mo Vazifeh , Paolo Santi

From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information?

While existing benchmarks probe the reasoning abilities of large language models (LLMs) across diverse domains, they predominantly assess passive reasoning, providing models with all the information needed to reach a solution. By contrast,…

Machine Learning · Computer Science 2025-06-11 Zhanke Zhou , Xiao Feng , Zhaocheng Zhu , Jiangchao Yao , Sanmi Koyejo , Bo Han

Medical Reasoning with Large Language Models: A Survey and MR-Bench

Large language models (LLMs) have achieved strong performance on medical exam-style tasks, motivating growing interest in their deployment in real-world clinical settings. However, clinical decision-making is inherently safety-critical,…

Computation and Language · Computer Science 2026-04-13 Xiaohan Ren , Chenxiao Fan , Wenyin Ma , Hongliang He , Chongming Gao , Xiaoyan Zhao , Fuli Feng

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Existing benchmarks for evaluating the clinical reasoning capabilities of large language models (LLMs) often lack a clear definition of "clinical reasoning" as a construct, fail to capture the full breadth of interdependent tasks within a…

Computation and Language · Computer Science 2026-05-12 Nikita Mehandru , Niloufar Golchini , Namrata Garg , Kathy T. LeSaint , Christopher J. Nash , Anu Ramachandran , Travis Zack , Liam G. McCoy , Adam Rodman , David Bamman , Melanie Molina , Ahmed Alaa

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have…

Artificial Intelligence · Computer Science 2025-09-17 Marylou Fauchard , Florian Carichon , Margarida Carvalho , Golnoosh Farnadi

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Large language models (LLMs) are increasingly deployed in settings where reasoning, such as multi-step problem solving and chain-of-thought, is essential. Yet, current evaluation practices overwhelmingly report single-run accuracy while…

Artificial Intelligence · Computer Science 2025-12-09 Nearchos Potamitis , Lars Klein , Akhil Arora

LLMs for Relational Reasoning: How Far are We?

Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general…

Artificial Intelligence · Computer Science 2024-01-18 Zhiming Li , Yushi Cao , Xiufeng Xu , Junzhe Jiang , Xu Liu , Yon Shin Teo , Shang-wei Lin , Yang Liu

LongReason: A Synthetic Long-Context Reasoning Benchmark via Context Expansion

Large language models (LLMs) have demonstrated remarkable progress in understanding long-context inputs. However, benchmarks for evaluating the long-context reasoning abilities of LLMs fall behind the pace. Existing benchmarks often focus…

Computation and Language · Computer Science 2025-11-19 Zhan Ling , Kang Liu , Kai Yan , Yifan Yang , Weijian Lin , Ting-Han Fan , Lingfeng Shen , Zhengyin Du , Jiecao Chen

CLR-Bench: Evaluating Large Language Models in College-level Reasoning

Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks. While emerging benchmarks have been proposed to evaluate LLMs in various domains such as mathematics and computer…

Artificial Intelligence · Computer Science 2024-10-28 Junnan Dong , Zijin Hong , Yuanchen Bei , Feiran Huang , Xinrun Wang , Xiao Huang

IOLBENCH: Benchmarking LLMs on Linguistic Reasoning

Despite the remarkable advancements and widespread applications of deep neural networks, their ability to perform reasoning tasks remains limited, particularly in domains requiring structured, abstract thought. In this paper, we investigate…

Computation and Language · Computer Science 2025-09-16 Satyam Goyal , Soham Dan

LLM-Cave: A benchmark and light environment for large language models reasoning and decision-making system

Large language models (LLMs) such as ChatGPT o1, ChatGPT o3, and DeepSeek R1 have shown great potential in solving difficult problems. However, current LLM evaluation benchmarks are limited to one-step interactions. Some of the existing…

Machine Learning · Computer Science 2025-12-01 Huanyu Li , Zongyuan Li , Wei Huang , Xian Guo

DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

DNR Bench: Benchmarking Over-Reasoning in Reasoning LLMs

Test-time scaling has significantly improved large language model performance, enabling deeper reasoning to solve complex problems. However, this increased reasoning capability also leads to excessive token generation and unnecessary…

Machine Learning · Computer Science 2025-04-21 Masoud Hashemi , Oluwanifemi Bamgbose , Sathwik Tejaswi Madhusudhan , Jishnu Sethumadhavan Nair , Aman Tiwari , Vikas Yadav