Related papers: CodePercept: Code-Grounded Visual STEM Perception …

CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based…

Computation and Language · Computer Science 2026-04-29 Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang , David Lo , Xiaodong Gu

From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and…

Artificial Intelligence · Computer Science 2025-10-17 Chenyue Zhou , Mingxuan Wang , Yanbiao Ma , Chenxu Wu , Wanyi Chen , Zhe Qian , Xinyu Liu , Yiwei Zhang , Junhao Wang , Hengbo Xu , Fei Luo , Xiaohua Chen , Xiaoshuai Hao , Hehan Li , Andi Zhang , Wenxuan Wang , Kaiyan Zhang , Guoli Jia , Lingling Li , Zhiwu Lu , Yang Lu , Yike Guo

CodeSense: a Real-World Benchmark and Dataset for Code Semantic Reasoning

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or…

Software Engineering · Computer Science 2026-02-05 Monoshi Kumar Roy , Simin Chen , Benjamin Steenhoek , Jinjun Peng , Gail Kaiser , Baishakhi Ray , Wei Le

RECODE: Reasoning Through Code Generation for Visual Question Answering

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering --…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Junhong Shen , Mu Cai , Bo Hu , Ameet Talwalkar , David A Ross , Cordelia Schmid , Alireza Fathi

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Aditya Kanade , Tanuja Ganu

VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding

Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Yizhuo Ding , Mingkang Chen , Zhibang Feng , Tong Xiao , Wanying Qu , Wenqi Shao , Yanwei Fu

CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Chengqi Duan , Kaiyue Sun , Rongyao Fang , Manyuan Zhang , Yan Feng , Ying Luo , Yufang Liu , Ke Wang , Peng Pei , Xunliang Cai , Hongsheng Li , Yi Ma , Xihui Liu

Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Jing Jin , Hao Liu , Yan Bai , Yihang Lou , Zhenke Wang , Tianrun Yuan , Juntong Chen , Yongkang Zhu , Fanhu Zeng , Xuanyu Zhu , Tao Feng , Yige Xu

Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills…

Computation and Language · Computer Science 2025-02-18 Hanbin Wang , Xiaoxuan Zhou , Zhipeng Xu , Keyuan Cheng , Yuxin Zuo , Kai Tian , Jingwei Song , Junting Lu , Wenhui Hu , Xueyang Liu

CodeMMLU: A Multi-Task Benchmark for Assessing Code Understanding & Reasoning Capabilities of CodeLLMs

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

Toward Cognitive Supersensing in Multimodal Large Language Model

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Boyi Li , Yifan Shen , Yuanzhe Liu , Yifan Xu , Jiateng Liu , Xinzhuo Li , Zhengyuan Li , Jingyuan Zhu , Yunhan Zhong , Fangzhou Lan , Jianguo Cao , James M. Rehg , Heng Ji , Ismini Lourentzou , Xu Cao

MathOPEval: A Fine-grained Evaluation Benchmark for Visual Operations of MLLMs in Mathematical Reasoning

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate…

Computation and Language · Computer Science 2025-11-06 Xiaoyuan Li , Moxin Li , Wenjie Wang , Rui Men , Yichang Zhang , Fuli Feng , Dayiheng Liu

Semantic Misalignment in Vision-Language Models under Perceptual Degradation

Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Guo Cheng

Caption This, Reason That: VLMs Caught in the Middle

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Zihan Weng , Lucas Gomez , Taylor Whittington Webb , Pouya Bashivan

Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward

Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with…

Machine Learning · Computer Science 2026-03-04 Tong Xiao , Xin Xu , Zhenya Huang , Hongyu Gao , Quan Liu , Qi Liu , Enhong Chen

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Zirun Guo , Minjie Hong , Feng Zhang , Kai Jia , Tao Jin

TopoPerception: A Shortcut-Free Evaluation of Global Visual Perception in Large Vision-Language Models

Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of…

Artificial Intelligence · Computer Science 2025-11-18 Wenhao Zhou , Hao Zheng , Rong Zhao

STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing

Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or…

Computation and Language · Computer Science 2025-06-03 Jiaru Zou , Qing Wang , Pratyush Thakur , Nickvash Kani

Math Blind: Failures in Diagram Understanding Undermine Reasoning in MLLMs

Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Yanpeng Sun , Shan Zhang , Wei Tang , Aotian Chen , Piotr Koniusz , Kai Zou , Yuan Xue , Anton van den Hengel

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Shuo Cao , Jiayang Li , Xiaohui Li , Yuandong Pu , Kaiwen Zhu , Yuanting Gao , Siqi Luo , Yi Xin , Qi Qin , Yu Zhou , Xiangyu Chen , Wenlong Zhang , Bin Fu , Yu Qiao , Yihao Liu