Related papers: Benchmarking Large Language Models with Integer Se…

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Language has long been conceived as an essential tool for human reasoning. The breakthrough of Large Language Models (LLMs) has sparked significant research interest in leveraging these models to tackle complex reasoning tasks. Researchers…

Artificial Intelligence · Computer Science 2025-01-24 Fengli Xu , Qianyue Hao , Zefang Zong , Jingwei Wang , Yunke Zhang , Jingyi Wang , Xiaochong Lan , Jiahui Gong , Tianjian Ouyang , Fanjin Meng , Chenyang Shao , Yuwei Yan , Qinglong Yang , Yiwen Song , Sijian Ren , Xinyuan Hu , Yu Li , Jie Feng , Chen Gao , Yong Li

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

An Empirical Study of Reasoning Steps in Thinking Code LLMs

Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these…

Artificial Intelligence · Computer Science 2025-11-11 Haoran Xue , Gias Uddin , Song Wang

Inference-Time Computations for LLM Reasoning and Planning: A Benchmark and Insights

We examine the reasoning and planning capabilities of large language models (LLMs) in solving complex tasks. Recent advances in inference-time techniques demonstrate the potential to enhance LLM reasoning without additional training by…

Artificial Intelligence · Computer Science 2025-02-19 Shubham Parashar , Blake Olson , Sambhav Khurana , Eric Li , Hongyi Ling , James Caverlee , Shuiwang Ji

Logical Reasoning in Large Language Models: A Survey

With the emergence of advanced reasoning models like OpenAI o3 and DeepSeek-R1, large language models (LLMs) have demonstrated remarkable reasoning capabilities. However, their ability to perform rigorous logical reasoning remains an open…

Artificial Intelligence · Computer Science 2025-02-14 Hanmeng Liu , Zhizhang Fu , Mengru Ding , Ruoxi Ning , Chaoli Zhang , Xiaozhang Liu , Yue Zhang

A Comparative Study on Reasoning Patterns of OpenAI's o1 Model

Enabling Large Language Models (LLMs) to handle a wider range of complex tasks (e.g., coding, math) has drawn great attention from many researchers. As LLMs continue to evolve, merely increasing the number of model parameters yields…

Computation and Language · Computer Science 2024-10-24 Siwei Wu , Zhongyuan Peng , Xinrun Du , Tuney Zheng , Minghao Liu , Jialong Wu , Jiachen Ma , Yizhi Li , Jian Yang , Wangchunshu Zhou , Qunshu Lin , Junbo Zhao , Zhaoxiang Zhang , Wenhao Huang , Ge Zhang , Chenghua Lin , J. H. Liu

Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems

Benchmarks are critical for measuring Large Language Model (LLM) reasoning capabilities. Some benchmarks have even become the de facto indicator of such capabilities. However, as LLM reasoning capabilities improve, existing widely-used…

Computation and Language · Computer Science 2025-02-26 Stephen Miner , Yoshiki Takashima , Simeng Han , Sam Kouteili , Ferhat Erata , Ruzica Piskac , Scott J Shapiro

Large Language Models and Mathematical Reasoning Failures

This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems. Unlike prior studies that focus solely on answer correctness, we rigorously analyze…

Artificial Intelligence · Computer Science 2025-02-24 Johan Boye , Birger Moell

REL: Working out is all you need

Recent developments, particularly OpenAI's O1 model, have demonstrated the remarkable potential of Large Language Models (LLMs) for complex reasoning tasks. Through analysis of O1's outputs and provided sample Chain-of-Thought (CoT)…

Artificial Intelligence · Computer Science 2024-12-09 Toby Simonds , Jey Han Lau , Chaithanya Bandi

Learning to Reason: Training LLMs with GPT-OSS or DeepSeek R1 Reasoning Traces

Test-time scaling, which leverages additional computation during inference to improve model accuracy, has enabled a new class of Large Language Models (LLMs) that are able to reason through complex problems by understanding the goal,…

Computation and Language · Computer Science 2025-11-25 Shaltiel Shmidman , Asher Fredman , Oleg Sudakov , Meriem Bendris

When LLM Meets Time Series: Can LLMs Perform Multi-Step Time Series Reasoning and Inference

The rapid advancement of Large Language Models (LLMs) has sparked growing interest in their application to time series analysis tasks. However, their ability to perform complex reasoning over temporal data in real-world application domains…

Machine Learning · Computer Science 2025-09-03 Wen Ye , Jinbo Liu , Defu Cao , Wei Yang , Yan Liu

Beyond Chains of Thought: Benchmarking Latent-Space Reasoning Abilities in Large Language Models

Large language models (LLMs) can perform reasoning computations both internally within their latent space and externally by generating explicit token sequences like chains of thought. Significant progress in enhancing reasoning abilities…

Computation and Language · Computer Science 2025-04-16 Thilo Hagendorff , Sarah Fabi

DeepSeek-R1 vs. o3-mini: How Well can Reasoning LLMs Evaluate MT and Summarization?

Reasoning-enabled large language models (LLMs) excel in logical tasks, yet their utility for evaluating natural language generation remains unexplored. This study systematically compares reasoning LLMs with non-reasoning counterparts across…

Computation and Language · Computer Science 2025-06-02 Daniil Larionov , Sotaro Takeshita , Ran Zhang , Yanran Chen , Christoph Leiter , Zhipin Wang , Christian Greisinger , Steffen Eger

Thinking Machines: A Survey of LLM based Reasoning Strategies

Large Language Models (LLMs) are highly proficient in language-based tasks. Their language capabilities have positioned them at the forefront of the future AGI (Artificial General Intelligence) race. However, on closer inspection, Valmeekam…

Computation and Language · Computer Science 2025-03-17 Dibyanayan Bandyopadhyay , Soham Bhattacharjee , Asif Ekbal

Mathematical Computation and Reasoning Errors by Large Language Models

Large Language Models (LLMs) are increasingly utilized in AI-driven educational instruction and assessment, particularly within mathematics education. The capability of LLMs to generate accurate answers and detailed solutions for math…

Artificial Intelligence · Computer Science 2025-08-15 Liang Zhang , Edith Aurora Graf

Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present LLMThinkBench, a…

Computation and Language · Computer Science 2026-04-24 Gaurav Srivastava , Aafiya Hussain , Sriram Srinivasan , Xuan Wang

Exposing Numeracy Gaps: A Benchmark to Evaluate Fundamental Numerical Abilities in Large Language Models

Large Language Models (LLMs) have demonstrated impressive capabilities in natural language processing tasks, such as text generation and semantic understanding. However, their performance on numerical reasoning tasks, such as basic…

Computation and Language · Computer Science 2025-06-04 Haoyang Li , Xuejia Chen , Zhanchao XU , Darian Li , Nicole Hu , Fei Teng , Yiming Li , Luyu Qiu , Chen Jason Zhang , Qing Li , Lei Chen

OckBench: Measuring the Efficiency of LLM Reasoning

Large language models (LLMs) such as GPT-5 and Gemini 3 have pushed the frontier of automated reasoning and code generation. Yet current benchmarks emphasize accuracy and output quality, neglecting a critical dimension: efficiency of token…

Computation and Language · Computer Science 2026-02-25 Zheng Du , Hao Kang , Song Han , Tushar Krishna , Ligeng Zhu

Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models

Recent advancements in large language models (LLMs) have led to significant breakthroughs in mathematical reasoning capabilities. However, existing benchmarks like GSM8K or MATH are now being solved with high accuracy (e.g., OpenAI o1…

Computation and Language · Computer Science 2024-12-25 Bofei Gao , Feifan Song , Zhe Yang , Zefan Cai , Yibo Miao , Qingxiu Dong , Lei Li , Chenghao Ma , Liang Chen , Runxin Xu , Zhengyang Tang , Benyou Wang , Daoguang Zan , Shanghaoran Quan , Ge Zhang , Lei Sha , Yichang Zhang , Xuancheng Ren , Tianyu Liu , Baobao Chang

LLMs for Relational Reasoning: How Far are We?

Large language models (LLMs) have revolutionized many areas (e.g. natural language processing, software engineering, etc.) by achieving state-of-the-art performance on extensive downstream tasks. Aiming to achieve robust and general…

Artificial Intelligence · Computer Science 2024-01-18 Zhiming Li , Yushi Cao , Xiufeng Xu , Junzhe Jiang , Xu Liu , Yon Shin Teo , Shang-wei Lin , Yang Liu