Related papers: ViperGPT: Visual Inference via Python Execution fo…

Analyzing Modular Approaches for Visual Question Decomposition

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks. The latest such methods simultaneously introduce LLM-based code generation to build…

Computer Vision and Pattern Recognition · Computer Science 2023-11-14 Apoorv Khandelwal , Ellie Pavlick , Chen Sun

Modular Visual Question Answering via Code Generation

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models…

Computation and Language · Computer Science 2023-06-09 Sanjay Subramanian , Medhini Narasimhan , Kushal Khangaonkar , Kevin Yang , Arsha Nagrani , Cordelia Schmid , Andy Zeng , Trevor Darrell , Dan Klein

Inferring and Executing Programs for Visual Reasoning

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases…

Computer Vision and Pattern Recognition · Computer Science 2017-05-11 Justin Johnson , Bharath Hariharan , Laurens van der Maaten , Judy Hoffman , Li Fei-Fei , C. Lawrence Zitnick , Ross Girshick

Visual Programming: Compositional visual reasoning without training

We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of…

Computer Vision and Pattern Recognition · Computer Science 2022-11-22 Tanmay Gupta , Aniruddha Kembhavi

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

With the emergence of large language models (LLMs) and vision foundation models, how to combine the intelligence and capacity of these open-sourced or API-available models to achieve open-world visual perception remains an open question. In…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Chris Kelly , Luhui Hu , Bang Yang , Yu Tian , Deshun Yang , Cindy Yang , Zaoshan Huang , Zihao Li , Jiayin Hu , Yuexian Zou

VIPER: Visual Perception and Explainable Reasoning for Sequential Decision-Making

While Large Language Models (LLMs) excel at reasoning on text and Vision-Language Models (VLMs) are highly effective for visual perception, applying those models for visual instruction-based planning remains a widely open problem. In this…

Machine Learning · Computer Science 2025-09-11 Mohamed Salim Aissi , Clemence Grislain , Mohamed Chetouani , Olivier Sigaud , Laure Soulier , Nicolas Thome

Synthesizing Visual Concepts as Vision-Language Programs

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning tasks, leading to inconsistent or illogical outputs. Neuro-symbolic methods promise to address this by inducing…

Artificial Intelligence · Computer Science 2025-11-25 Antonia Wüst , Wolfgang Stammer , Hikaru Shindo , Lukas Helff , Devendra Singh Dhami , Kristian Kersting

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

Recent research on Large Language Models (LLMs) has led to remarkable advancements in general NLP AI assistants. Some studies have further explored the use of LLMs for planning and invoking models or APIs to address more general multi-modal…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Difei Gao , Lei Ji , Luowei Zhou , Kevin Qinghong Lin , Joya Chen , Zihan Fan , Mike Zheng Shou

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including…

Computer Vision and Pattern Recognition · Computer Science 2023-11-08 Jun Chen , Deyao Zhu , Xiaoqian Shen , Xiang Li , Zechun Liu , Pengchuan Zhang , Raghuraman Krishnamoorthi , Vikas Chandra , Yunyang Xiong , Mohamed Elhoseiny

VisualCoder: Guiding Large Language Models in Code Execution with Fine-grained Multimodal Chain-of-Thought Reasoning

Predicting program behavior and reasoning about code execution remain significant challenges in software engineering, particularly for large language models (LLMs) designed for code analysis. While these models excel at understanding static…

Software Engineering · Computer Science 2025-02-11 Cuong Chi Le , Hoang-Chau Truong-Vinh , Huy Nhat Phan , Dung Duy Le , Tien N. Nguyen , Nghi D. Q. Bui

Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions

Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case…

Computer Vision and Pattern Recognition · Computer Science 2021-06-18 Radhika Dua , Sai Srinivas Kancheti , Vineeth N Balasubramanian

Recursive Visual Programming

Visual Programming (VP) has emerged as a powerful framework for Visual Question Answering (VQA). By generating and executing bespoke code for each question, these methods demonstrate impressive compositional and reasoning capabilities,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-11 Jiaxin Ge , Sanjay Subramanian , Baifeng Shi , Roei Herzig , Trevor Darrell

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

Visual Product Graph: Bridging Visual Products And Composite Images For End-to-End Style Recommendations

Retrieving semantically similar but visually distinct contents has been a critical capability in visual search systems. In this work, we aim to tackle this problem with Visual Product Graph (VPG), leveraging high-performance infrastructure…

Computer Vision and Pattern Recognition · Computer Science 2025-05-28 Yue Li Du , Ben Alexander , Mikhail Antonenka , Rohan Mahadev , Hao-yu Wu , Dmitry Kislyuk

PandaGPT: One Model To Instruction-Follow Them All

We present PandaGPT, an approach to emPower large lANguage moDels with visual and Auditory instruction-following capabilities. Our pilot experiments show that PandaGPT can perform complex tasks such as detailed image description generation,…

Computation and Language · Computer Science 2023-05-29 Yixuan Su , Tian Lan , Huayang Li , Jialu Xu , Yan Wang , Deng Cai

VisorGPT: Learning Visual Prior via Generative Pre-Training

Various stuff and things in visual data possess specific traits, which can be learned by deep neural networks and are implicitly represented as the visual prior, e.g., object location and shape, in the model. Such prior potentially impacts…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Jinheng Xie , Kai Ye , Yudong Li , Yuexiang Li , Kevin Qinghong Lin , Yefeng Zheng , Linlin Shen , Mike Zheng Shou

SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Ming Dai , Lingfeng Yang , Yihao Xu , Zhenhua Feng , Wankou Yang

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems.…

Computation and Language · Computer Science 2024-10-07 Xueqing Wu , Zongyu Lin , Songyan Zhao , Te-Lin Wu , Pan Lu , Nanyun Peng , Kai-Wei Chang

ViUniT: Visual Unit Tests for More Robust Visual Programming

Programming based approaches to reasoning tasks have substantially expanded the types of questions models can answer about visual scenes. Yet on benchmark visual reasoning data, when models answer correctly, they produce incorrect programs…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Artemis Panagopoulou , Honglu Zhou , Silvio Savarese , Caiming Xiong , Chris Callison-Burch , Mark Yatskar , Juan Carlos Niebles

Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought

Despite exciting recent results showing vision-language systems' capacity to reason about images using natural language, their capacity for video reasoning remains under-explored. We motivate framing video reasoning as the sequential…

Computation and Language · Computer Science 2023-11-10 Vaishnavi Himakunthala , Andy Ouyang , Daniel Rose , Ryan He , Alex Mei , Yujie Lu , Chinmay Sonar , Michael Saxon , William Yang Wang