Related papers: CodeGlance: Understanding Code Reasoning Challenge…

Exploring Code Analysis: Zero-Shot Insights on Syntax and Semantics with LLMs

Code analysis is fundamental in Software Engineering, supporting debugging, optimization, and security assessment. Human developers approach it through syntax parsing, static semantics inference, and dynamic reasoning. Traditional tools are…

Software Engineering · Computer Science 2026-05-22 Wei Ma , Zhihao Lin , Shangqing Liu , Qiang Hu , Ye Liu , Wenhan Wang , Cen Zhang , Liming Nie , Li Li , Yang Liu , Lingxiao Jiang

Coding Triangle: How Does Large Language Model Understand Code?

Large language models (LLMs) have achieved remarkable progress in code generation, yet their true programming competence remains underexplored. We introduce the Code Triangle framework, which systematically evaluates LLMs across three…

Computation and Language · Computer Science 2025-07-09 Taolin Zhang , Zihan Ma , Maosong Cao , Junnan Liu , Songyang Zhang , Kai Chen

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills…

Computation and Language · Computer Science 2025-02-18 Hanbin Wang , Xiaoxuan Zhou , Zhipeng Xu , Keyuan Cheng , Yuxin Zuo , Kai Tian , Jingwei Song , Junting Lu , Wenhui Hu , Xueyang Liu

CodeMind: Evaluating Large Language Models for Code Reasoning

Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a…

Software Engineering · Computer Science 2026-04-08 Changshu Liu , Yang Chen , Reyhaneh Jabbarvand

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls,…

Software Engineering · Computer Science 2026-04-27 Changshu Liu , Alireza Ghazanfari , Yang Chen , Reyhaneh Jabbarvand

Reasoning Runtime Behavior of a Program with LLM: How Far Are We?

Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. To evaluate the capabilities of code LLMs in various aspects, many benchmarks have been proposed (e.g., HumanEval and…

Software Engineering · Computer Science 2024-09-24 Junkai Chen , Zhiyuan Pan , Xing Hu , Zhenhao Li , Ge Li , Xin Xia

Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation

Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning, such as code generation, mathematical problem solving, and algorithmic synthesis -- especially when aided by reasoning tokens and…

Computation and Language · Computer Science 2025-06-13 Jaechul Roh , Varun Gandhi , Shivani Anilkumar , Arin Garg

An Empirical Study of Reasoning Steps in Thinking Code LLMs

Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, the quality of these…

Artificial Intelligence · Computer Science 2025-11-11 Haoran Xue , Gias Uddin , Song Wang

CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning

Large Language Models (LLMs) have recently demonstrated strong capabilities in code-related tasks, but their robustness in code reasoning under perturbations remains underexplored. We introduce CodeCrash, a stress-testing framework with…

Artificial Intelligence · Computer Science 2025-10-14 Man Ho Lam , Chaozheng Wang , Jen-tse Huang , Michael R. Lyu

CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks

Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code…

Software Engineering · Computer Science 2026-01-21 Danning Xie , Mingwei Zheng , Xuwei Liu , Jiannan Wang , Chengpeng Wang , Lin Tan , Xiangyu Zhang

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Zirun Guo , Minjie Hong , Feng Zhang , Kai Jia , Tao Jin

Unlocking Reasoning Potential in Large Langauge Models by Scaling Code-form Planning

Despite the remarkable success of large language models (LLMs) on traditional natural language processing tasks, their planning ability remains a critical bottleneck in tackling complex multi-step reasoning tasks. Existing approaches mainly…

Computation and Language · Computer Science 2024-10-07 Jiaxin Wen , Jian Guan , Hongning Wang , Wei Wu , Minlie Huang

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or…

Software Engineering · Computer Science 2026-02-05 Monoshi Kumar Roy , Simin Chen , Benjamin Steenhoek , Jinjun Peng , Gail Kaiser , Baishakhi Ray , Wei Le

CodeGraph: Enhancing Graph Reasoning of LLMs with Code

With the increasing popularity of large language models (LLMs), reasoning on basic graph algorithm problems is an essential intermediate step in assessing their abilities to process and infer complex graph reasoning tasks. Existing methods…

Computation and Language · Computer Science 2024-08-27 Qiaolong Cai , Zhaowei Wang , Shizhe Diao , James Kwok , Yangqiu Song

Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate…

Software Engineering · Computer Science 2025-12-30 Mootez Saad , Boqi Chen , José Antonio Hernández López , Dániel Varró , Tushar Sharma

Code Simulation Challenges for Large Language Models

Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. This work studies to what extent Large Language Models (LLMs) can…

Machine Learning · Computer Science 2024-06-13 Emanuele La Malfa , Christoph Weinhuber , Orazio Torre , Fangru Lin , Samuele Marro , Anthony Cohn , Nigel Shadbolt , Michael Wooldridge

Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

Large Language Models (LLMs) have revolutionized both general natural language processing and domain-specific applications such as code synthesis, legal reasoning, and finance. However, while prior studies have explored individual model…

Software Engineering · Computer Science 2025-12-05 Gunjan Das , Paheli Bhattacharya , Rishabh Gupta

ProBench: Benchmarking Large Language Models in Competitive Programming

With reasoning language models such as OpenAI-o3 and DeepSeek-R1 emerging, large language models (LLMs) have entered a new phase of development. However, existing benchmarks for coding evaluation are gradually inadequate to assess the…

Computation and Language · Computer Science 2025-03-03 Lei Yang , Renren Jin , Ling Shi , Jianxiang Peng , Yue Chen , Deyi Xiong

Understanding Codebase like a Professional! Human-AI Collaboration for Code Comprehension

Understanding an unfamiliar codebase is an essential task for developers in various scenarios, such as during the onboarding process. Especially when the codebase is large and time is limited, achieving a decent level of comprehension…

Human-Computer Interaction · Computer Science 2026-02-16 Jie Gao , Yue Xue , Xiaofei Xie , SoeMin Thant , Erika Lee , Bowen Xu