Related papers: Transductive Visual Programming: Evolving Tool Lib…

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models

Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities, their performance is still inferior to specialized models on downstream tasks, which makes adaptation necessary to enhance their utility.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-18 Yichi Zhang , Yinpeng Dong , Siyuan Zhang , Tianzan Min , Hang Su , Jun Zhu

Text-Visual Prompting for Efficient 2D Temporal Video Grounding

In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual…

Computer Vision and Pattern Recognition · Computer Science 2023-10-05 Yimeng Zhang , Xin Chen , Jinghan Jia , Sijia Liu , Ke Ding

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by…

Computer Vision and Pattern Recognition · Computer Science 2024-04-08 Yushi Hu , Otilia Stretcu , Chun-Ta Lu , Krishnamurthy Viswanathan , Kenji Hata , Enming Luo , Ranjay Krishna , Ariel Fuxman

Recursive Visual Programming

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-11 Jiaxin Ge , Sanjay Subramanian , Baifeng Shi , Roei Herzig , Trevor Darrell

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning

Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often…

Computer Vision and Pattern Recognition · Computer Science 2024-06-18 Yiwu Zhong , Zi-Yuan Hu , Michael R. Lyu , Liwei Wang

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Large Language Model (LLM)-based agents have shown promise in procedural tasks, but the potential of multimodal instructions augmented by texts and videos to assist users remains under-explored. To address this gap, we propose the Visually…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Muhammet Furkan Ilaslan , Ali Koksal , Kevin Qinhong Lin , Burak Satar , Mike Zheng Shou , Qianli Xu

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial…

Computer Vision and Pattern Recognition · Computer Science 2025-09-22 Haoyu Zhang , Meng Liu , Zaijing Li , Haokun Wen , Weili Guan , Yaowei Wang , Liqiang Nie

Image Translation as Diffusion Visual Programmers

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence…

Computer Vision and Pattern Recognition · Computer Science 2024-02-01 Cheng Han , James C. Liang , Qifan Wang , Majid Rabbani , Sohail Dianat , Raghuveer Rao , Ying Nian Wu , Dongfang Liu

Reinforced Visual Perception with Tools

Visual reasoning, a cornerstone of human intelligence, encompasses complex perceptual and logical processes essential for solving diverse visual problems. While advances in computer vision have produced powerful models for various…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Zetong Zhou , Dongping Chen , Zixian Ma , Zhihan Hu , Mingyang Fu , Sinan Wang , Yao Wan , Zhou Zhao , Ranjay Krishna

Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing…

Artificial Intelligence · Computer Science 2025-11-25 Antonia Wüst , Wolfgang Stammer , Hikaru Shindo , Lukas Helff , Devendra Singh Dhami , Kristian Kersting

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant…

Computer Vision and Pattern Recognition · Computer Science 2026-03-03 Xinhai Hou , Shaoyuan Xu , Manan Biyani , Moyan Li , Jia Liu , Todd C. Hollon , Bryan Wang

Enhancing Visual Programming for Visual Reasoning via Probabilistic Graphs

Recently, Visual Programming (VP) based on large language models (LLMs) has rapidly developed and demonstrated significant potential in complex Visual Reasoning (VR) tasks. Previous works to enhance VP have primarily focused on improving…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Wentao Wan , Kaiyu Wu , Qingyang Ma , Nan Kang , Yunjie Chen , Liang Lin , Keze Wang

Enhancing Spatial Reasoning through Visual and Textual Thinking

The spatial reasoning task aims to reason about the spatial relationships in 2D and 3D space, which is a fundamental capability for Visual Question Answering (VQA) and robotics. Although vision language models (VLMs) have developed rapidly…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Xun Liang , Xin Guo , Zhongming Jin , Weihang Pan , Penghui Shang , Deng Cai , Binbin Lin , Jieping Ye

Toward Causal-Visual Programming: Enhancing Agentic Reasoning in Low-Code Environments

Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low-code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms…

Artificial Intelligence · Computer Science 2025-10-09 Jiexi Xu , Jiaqi Liu , Lanruo Wang , Su Liu

Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting

Pre-trained language models (PLMs) have played an increasing role in multimedia research. In terms of vision-language (VL) tasks, they often serve as a language encoder and still require an additional fusion network for VL reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2023-08-23 Shubin Huang , Qiong Wu , Yiyi Zhou , Weijie Chen , Rongsheng Zhang , Xiaoshuai Sun , Rongrong Ji

AutoVP: An Automated Visual Prompting Framework and Benchmark

Visual prompting (VP) is an emerging parameter-efficient fine-tuning approach to adapting pre-trained vision models to solve various downstream image-classification tasks. However, there has hitherto been little systematic study of the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Hsi-Ai Tsao , Lei Hsiung , Pin-Yu Chen , Sijia Liu , Tsung-Yi Ho

Programmable-Room: Interactive Textured 3D Room Meshes Generation Empowered by Large Language Models

We present Programmable-Room, a framework which interactively generates and edits a 3D room mesh, given natural language instructions. For precise control of a room's each attribute, we decompose the challenging task into simpler steps such…

Computer Vision and Pattern Recognition · Computer Science 2025-06-24 Jihyun Kim , Junho Park , Kyeongbo Kong , Suk-Ju Kang

Visual Agentic AI for Spatial Reasoning with a Dynamic API

Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images.…

Computer Vision and Pattern Recognition · Computer Science 2025-03-31 Damiano Marsili , Rohun Agrawal , Yisong Yue , Georgia Gkioxari

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Qi Song , Honglin Li , Yingchen Yu , Haoyi Zhou , Lin Yang , Song Bai , Qi She , Zilong Huang , Yunqing Zhao

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Zirun Guo , Minjie Hong , Feng Zhang , Kai Jia , Tao Jin