Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
@article{arxiv.2510.13756,
title = {RECODE: Reasoning Through Code Generation for Visual Question Answering},
author = {Junhong Shen and Mu Cai and Bo Hu and Ameet Talwalkar and David A Ross and Cordelia Schmid and Alireza Fathi},
journal= {arXiv preprint arXiv:2510.13756},
year = {2026}
}
Comments
The authors are withdrawing this manuscript temporarily to conduct additional checks of the experimental setup and implementation. We plan to post an updated version after completing these checks