English
Related papers

Related papers: CodePercept: Code-Grounded Visual STEM Perception …

200 papers

Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based…

Computation and Language · Computer Science 2026-04-29 Yuling Shi , Chaoxiang Xie , Zhensu Sun , Yeheng Chen , Chenxu Zhang , Longfei Yun , Chengcheng Wan , Hongyu Zhang , David Lo , Xiaodong Gu

Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and…

Understanding and reasoning about code semantics is essential for enhancing code LLMs' abilities to solve real-world software engineering (SE) tasks. Although several code reasoning benchmarks exist, most rely on synthetic datasets or…

Software Engineering · Computer Science 2026-02-05 Monoshi Kumar Roy , Simin Chen , Benjamin Steenhoek , Jinjun Peng , Gail Kaiser , Baishakhi Ray , Wei Le

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering --…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Junhong Shen , Mu Cai , Bo Hu , Ameet Talwalkar , David A Ross , Cordelia Schmid , Alireza Fathi

Multimodal Large Language Models (MLLMs) show reasoning promise, yet their visual perception is a critical bottleneck. Strikingly, MLLMs can produce correct answers even while misinterpreting crucial visual elements, masking these…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Aditya Kanade , Tanuja Ganu

Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence. We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Yizhuo Ding , Mingkang Chen , Zhibang Feng , Tong Xiao , Wanying Qu , Wenqi Shao , Yanwei Fu

Recent advances in Large Language Models (LLMs) and Vision Language Models (VLMs) have shown significant progress in mathematical reasoning, yet they still face a critical bottleneck with problems requiring visual assistance, such as…

Computer Vision and Pattern Recognition · Computer Science 2025-10-14 Chengqi Duan , Kaiyue Sun , Rongyao Fang , Manyuan Zhang , Yan Feng , Ying Luo , Yufang Liu , Ke Wang , Peng Pei , Xunliang Cai , Hongsheng Li , Yi Ma , Xihui Liu

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Jing Jin , Hao Liu , Yan Bai , Yihang Lou , Zhenke Wang , Tianrun Yuan , Juntong Chen , Yongkang Zhu , Fanhu Zeng , Xuanyu Zhu , Tao Feng , Yige Xu

This paper introduces Code-Vision, a benchmark designed to evaluate the logical understanding and code generation capabilities of Multimodal Large Language Models (MLLMs). It challenges MLLMs to generate a correct program that fulfills…

Computation and Language · Computer Science 2025-02-18 Hanbin Wang , Xiaoxuan Zhou , Zhipeng Xu , Keyuan Cheng , Yuxin Zuo , Kai Tian , Jingwei Song , Junting Lu , Wenhui Hu , Xueyang Liu

Recent advances in Code Large Language Models (CodeLLMs) have primarily focused on open-ended code generation, often overlooking the crucial aspect of code understanding and reasoning. To bridge this gap, we introduce CodeMMLU, a…

Software Engineering · Computer Science 2025-04-10 Dung Nguyen Manh , Thang Phan Chau , Nam Le Hai , Thong T. Doan , Nam V. Nguyen , Quang Pham , Nghi D. Q. Bui

Multimodal Large Language Models (MLLMs) have achieved remarkable success in open-vocabulary perceptual tasks, yet their ability to solve complex cognitive problems remains limited, especially when visual details are abstract and require…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Boyi Li , Yifan Shen , Yuanzhe Liu , Yifan Xu , Jiateng Liu , Xinzhuo Li , Zhengyuan Li , Jingyuan Zhu , Yunhan Zhong , Fangzhou Lan , Jianguo Cao , James M. Rehg , Heng Ji , Ismini Lourentzou , Xu Cao

Recent progress in Multi-modal Large Language Models (MLLMs) has enabled step-by-step multi-modal mathematical reasoning by performing visual operations based on the textual instructions. A promising approach uses code as an intermediate…

Computation and Language · Computer Science 2025-11-06 Xiaoyuan Li , Moxin Li , Wenjie Wang , Rui Men , Yichang Zhang , Fuli Feng , Dayiheng Liu

Vision-Language Models (VLMs) are increasingly deployed in autonomous driving and embodied AI systems, where reliable perception is critical for safe semantic reasoning and decision-making. While recent VLMs demonstrate strong performance…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Guo Cheng

Vision-Language Models (VLMs) have shown remarkable progress in visual understanding in recent years. Yet, they still lag behind human capabilities in specific visual tasks such as counting or relational reasoning. To understand the…

Computer Vision and Pattern Recognition · Computer Science 2025-11-14 Zihan Weng , Lucas Gomez , Taylor Whittington Webb , Pouya Bashivan

Enhancing the multimodal reasoning capabilities of Multimodal Large Language Models (MLLMs) is a challenging task that has attracted increasing attention in the community. Recently, several studies have applied Reinforcement Learning with…

Machine Learning · Computer Science 2026-03-04 Tong Xiao , Xin Xu , Zhenya Huang , Hongyu Gao , Quan Liu , Qi Liu , Enhong Chen

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Zirun Guo , Minjie Hong , Feng Zhang , Kai Jia , Tao Jin

Large Vision-Language Models (LVLMs) typically align visual features from an encoder with a pre-trained Large Language Model (LLM). However, this makes the visual perception module a bottleneck, which constrains the overall capabilities of…

Artificial Intelligence · Computer Science 2025-11-18 Wenhao Zhou , Hao Zheng , Rong Zhao

Advances in large language models (LLMs) have spurred research into enhancing their reasoning capabilities, particularly in math-rich STEM (Science, Technology, Engineering, and Mathematics) documents. While LLMs can generate equations or…

Computation and Language · Computer Science 2025-06-03 Jiaru Zou , Qing Wang , Pratyush Thakur , Nickvash Kani

Diagrams represent a form of visual language that encodes abstract concepts and relationships through structured symbols and their spatial arrangements. Unlike natural images, they are inherently symbolic, and entirely artificial. They thus…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Yanpeng Sun , Shan Zhang , Wei Tang , Aotian Chen , Piotr Koniusz , Kai Zou , Yuan Xue , Anton van den Hengel

Multimodal large language models (MLLMs) have achieved remarkable progress in visual understanding tasks such as visual grounding, segmentation, and captioning. However, their ability to perceive perceptual-level image features remains…

Computer Vision and Pattern Recognition · Computer Science 2025-12-29 Shuo Cao , Jiayang Li , Xiaohui Li , Yuandong Pu , Kaiwen Zhu , Yuanting Gao , Siqi Luo , Yi Xin , Qi Qin , Yu Zhou , Xiangyu Chen , Wenlong Zhang , Bin Fu , Yu Qiao , Yihao Liu
‹ Prev 1 2 3 10 Next ›