Related papers: Visual Program Distillation with Template-Based Au…

Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets

Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning…

Artificial Intelligence · Computer Science 2024-10-15 Thomas Eiter , Jan Hadl , Nelson Higuera , Johannes Oetsch

Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by…

Computer Vision and Pattern Recognition · Computer Science 2024-04-08 Yushi Hu , Otilia Stretcu , Chun-Ta Lu , Krishnamurthy Viswanathan , Kenji Hata , Enming Luo , Ranjay Krishna , Ariel Fuxman

Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Zaid Khan , Vijay Kumar BG , Samuel Schulter , Yun Fu , Manmohan Chandraker

Sub-goal Distillation: A Method to Improve Small Language Agents

While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in…

Machine Learning · Computer Science 2024-05-07 Maryam Hashemzadeh , Elias Stengel-Eskin , Sarath Chandar , Marc-Alexandre Cote

Synthesize Step-by-Step: Tools, Templates and LLMs as Data Generators for Reasoning-Based Chart VQA

Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Zhuowan Li , Bhavan Jasani , Peng Tang , Shabnam Ghadar

Towards Developing a Multilingual and Code-Mixed Visual Question Answering System by Knowledge Distillation

Pre-trained language-vision models have shown remarkable performance on the visual question answering (VQA) task. However, most pre-trained models are trained by only considering monolingual learning, especially the resource-rich language…

Computation and Language · Computer Science 2021-09-13 Humair Raj Khan , Deepak Gupta , Asif Ekbal

Effective Distillation of Table-based Reasoning Ability from LLMs

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their enormous parameter size and extremely high requirements for compute power pose challenges for…

Computation and Language · Computer Science 2024-03-26 Bohao Yang , Chen Tang , Kun Zhao , Chenghao Xiao , Chenghua Lin

SDRT: Enhance Vision-Language Models by Self-Distillation with Diverse Reasoning Traces

Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Guande Wu , Huan Song , Yawei Wang , Qiaojing Yan , Yijun Tian , Lin Lee Cheong , Panpan Xu

Meta-Adaptive Prompt Distillation for Few-Shot Visual Question Answering

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve…

Artificial Intelligence · Computer Science 2026-03-03 Akash Gupta , Amos Storkey , Mirella Lapata

Data Factory with Minimal Human Effort Using VLMs

Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face…

Computer Vision and Pattern Recognition · Computer Science 2025-10-08 Jiaojiao Ye , Jiaxing Zhong , Qian Xie , Yuzhou Zhou , Niki Trigoni , Andrew Markham

Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models

Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models…

Computation and Language · Computer Science 2024-12-10 Patrick Amadeus Irawan , Genta Indra Winata , Samuel Cahyawijaya , Ayu Purwarianti

Variational Template Machine for Data-to-Text Generation

How to generate descriptions from structured data organized in tables? Existing approaches using neural encoder-decoder models often suffer from lacking diversity. We claim that an open set of templates is crucial for enriching the phrase…

Computation and Language · Computer Science 2020-02-14 Rong Ye , Wenxian Shi , Hao Zhou , Zhongyu Wei , Lei Li

Rethinking Data Augmentation for Robust Visual Question Answering

Data Augmentation (DA) -- generating extra training samples beyond original training set -- has been widely-used in today's unbiased VQA models to mitigate the language biases. Current mainstream DA strategies are synthetic-based methods,…

Computer Vision and Pattern Recognition · Computer Science 2022-09-16 Long Chen , Yuhang Zheng , Jun Xiao

Tackling VQA with Pretrained Foundation Models without Further Training

Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability…

Computer Vision and Pattern Recognition · Computer Science 2023-09-28 Alvin De Jun Tan , Bingquan Shen

FashionVQA: A Domain-Specific Visual Question Answering System

Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural…

Computer Vision and Pattern Recognition · Computer Science 2022-08-25 Min Wang , Ata Mahjoubfar , Anupama Joshi

Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications

We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in…

Computation and Language · Computer Science 2024-10-11 Oren Sultan , Alex Khasin , Guy Shiran , Asnat Greenstein-Messica , Dafna Shahaf

Winning Big with Small Models: Knowledge Distillation vs. Self-Training for Reducing Hallucination in Product QA Agents

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented…

Computation and Language · Computer Science 2025-07-22 Ashley Lewis , Michael White , Jing Liu , Toshiaki Koike-Akino , Kieran Parsons , Ye Wang

Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation

Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical…

Computation and Language · Computer Science 2025-09-11 Chuanqi Cheng , Jian Guan , Wei Wu , Rui Yan

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non…

Computer Vision and Pattern Recognition · Computer Science 2023-06-08 Zaid Khan , Vijay Kumar BG , Samuel Schulter , Xiang Yu , Yun Fu , Manmohan Chandraker

Modular Visual Question Answering via Code Generation

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models…

Computation and Language · Computer Science 2023-06-09 Sanjay Subramanian , Medhini Narasimhan , Kushal Khangaonkar , Kevin Yang , Arsha Nagrani , Cordelia Schmid , Andy Zeng , Trevor Darrell , Dan Klein