Related papers: RECODE: Reasoning Through Code Generation for Visu…

ReLook: Vision-Grounded RL with a Multimodal LLM Critic for Agentic Web Coding

While Large Language Models (LLMs) excel at algorithmic code generation, they struggle with front-end development, where correctness is judged on rendered pixels and interaction. We present ReLook, an agentic, vision-grounded reinforcement…

Machine Learning · Computer Science 2025-10-14 Yuhang Li , Chenchen Zhang , Ruilin Lv , Ao Liu , Ken Deng , Yuanxing Zhang , Jiaheng Liu , Wiggin Zhou , Bo Zhou

MMCode: Benchmarking Multimodal Large Language Models for Code Generation with Visually Rich Programming Problems

Programming often involves converting detailed and complex specifications into code, a process during which developers typically utilize visual aids to more effectively convey concepts. While recent developments in Large Multimodal Models…

Computation and Language · Computer Science 2024-09-27 Kaixin Li , Yuchen Tian , Qisheng Hu , Ziyang Luo , Zhiyong Huang , Jing Ma

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Code has emerged as a precise and executable medium for reasoning and action in the agent era. Yet, progress has largely focused on language-centric tasks such as program synthesis and debugging, leaving visual-centric coding underexplored.…

Computer Vision and Pattern Recognition · Computer Science 2025-11-05 Kevin Qinghong Lin , Yuhao Zheng , Hangyu Ran , Dantong Zhu , Dongxing Mao , Linjie Li , Philip Torr , Alex Jinpeng Wang

VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Humans possess the remarkable skill of Visual Perception, the ability to see and understand the seen, helping them make sense of the visual world and, in turn, reason. Multimodal Large Language Models (MLLM) have recently achieved…

Computer Vision and Pattern Recognition · Computer Science 2023-12-25 Jitesh Jain , Jianwei Yang , Humphrey Shi

Inferring and Executing Programs for Visual Reasoning

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases…

Computer Vision and Pattern Recognition · Computer Science 2017-05-11 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Judy Hoffman , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

CodePercept: Code-Grounded Visual STEM Perception for MLLMs

When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Tongkun Guan , Zhibo Yang , Jianqiang Wan , Mingkun Yang , Zhengtao Guo , Zijian Hu , Ruilin Luo , Ruize Chen , Songtao Jiang , Peng Wang , Wei Shen , Junyang Lin , Xiaokang Yang

\texttt{ReMind}: Understanding Deductive Code Reasoning in LLMs

Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with \emph{deductive code reasoning}, the ability to reason about the…

Programming Languages · Computer Science 2025-11-04 Jun Gao , Yun Peng , Xiaoxue Ren

Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

Large Vision-Language Models (LVLMs) can reason from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Xinrong Chen , Xu Chu , Yingmin Qiu , Hengyuan Zhang , Jing Xiong , Shiyu Tang , Shuai Liu , Shaokang Yang , Cheng Yang , Hayden Kwok-Hay So , Ngai Wong

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose…

Artificial Intelligence · Computer Science 2026-05-08 Zehao Deng , Tianjie Ju , Zheng Wu , Liangbo He , Jun Lan , Huijia Zhu , Weiqiang Wang , Zhuosheng Zhang

Illuminating LLM Coding Agents: Visual Analytics for Deeper Understanding and Enhancement

Coding agents powered by large language models (LLMs) have gained traction for automating code generation through iterative problem-solving with minimal human involvement. Despite the emergence of various frameworks, e.g., LangChain,…

Machine Learning · Computer Science 2025-08-19 Junpeng Wang , Yuzhong Chen , Menghai Pan , Chin-Chia Michael Yeh , Mahashweta Das

VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static…

Software Engineering · Computer Science 2025-02-11 Cuong Chi Le , Hoang-Chau Truong-Vinh , Huy Nhat Phan , Dung Duy Le , Tien N. Nguyen , Nghi D. Q. Bui

Latent Visual Reasoning

Multimodal Large Language Models (MLLMs) have achieved notable gains in various tasks by incorporating Chain-of-Thought (CoT) reasoning in language spaces. Recent work extends this direction by leveraging external tools for visual editing,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-07 Bangzheng Li , Ximeng Sun , Jiang Liu , Ze Wang , Jialian Wu , Xiaodong Yu , Hao Chen , Emad Barsoum , Muhao Chen , Zicheng Liu

V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Dongyang Chen , Chaoyang Wang , Dezhao Su , Xi Xiao , Zeyu Zhang , Jing Xiong , Qing Li , Yuzhang Shang , Shichao Kan

Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models

In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately…

Artificial Intelligence · Computer Science 2025-09-23 Jun Ling , Yao Qi , Tao Huang , Shibo Zhou , Yanqin Huang , Jiang Yang , Ziqi Song , Ying Zhou , Yang Yang , Heng Tao Shen , Peng Wang

De-rendering, Reasoning, and Repairing Charts with Vision-Language Models

Data visualizations are central to scientific communication, journalism, and everyday decision-making, yet they are frequently prone to errors that can distort interpretation or mislead audiences. Rule-based visualization linters can flag…

Computer Vision and Pattern Recognition · Computer Science 2026-02-25 Valentin Bonas , Martin Sinnona , Viviana Siless , Emmanuel Iarussi

Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback

Multimodal Large Language Models (MLLMs) exhibit impressive performance across various visual tasks. Subsequent investigations into enhancing their visual reasoning abilities have significantly expanded their performance envelope. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Yang Chen , Yufan Shen , Wenxuan Huang , Sheng Zhou , Qunshu Lin , Xinyu Cai , Zhi Yu , Jiajun Bu , Botian Shi , Yu Qiao

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yilei Jiang , Yaozhi Zheng , Yuxuan Wan , Jiaming Han , Qunzhong Wang , Michael R. Lyu , Xiangyu Yue

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Zirun Guo , Minjie Hong , Feng Zhang , Kai Jia , Tao Jin

VisCoder2: Building Multi-Language Visualization Coding Agents

Large language models (LLMs) have recently enabled coding agents capable of generating, executing, and revising visualization code. However, existing models often fail in practical workflows due to limited language coverage, unreliable…

Software Engineering · Computer Science 2026-04-09 Yuansheng Ni , Songcheng Cai , Xiangchao Chen , Jiarong Liang , Zhiheng Lyu , Jiaqi Deng , Kai Zou , Ping Nie , Fei Yuan , Xiang Yue , Wenhu Chen

Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

Image-to-code generation tests whether a vision-language model (VLM) can recover the structure of an image enough to express it as executable code. Existing benchmarks either focus on narrow visual domains, depend on paired executable…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Ajay Vikram Periasami , Junlin Wang , Bhuwan Dhingra