English
Related papers

Related papers: Visual Program Distillation with Template-Based Au…

200 papers

Visual Question Answering (VQA) is the task of answering a question about an image and requires processing multimodal input and reasoning to obtain the answer. Modular solutions that use declarative representations within the reasoning…

Artificial Intelligence · Computer Science 2024-10-15 Thomas Eiter , Jan Hadl , Nelson Higuera , Johannes Oetsch

Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space, recognizing instruments, and also retrieving prior knowledge. Recent work shows promise by…

Computer Vision and Pattern Recognition · Computer Science 2024-04-08 Yushi Hu , Otilia Stretcu , Chun-Ta Lu , Krishnamurthy Viswanathan , Kenji Hata , Enming Luo , Ranjay Krishna , Ariel Fuxman

Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Zaid Khan , Vijay Kumar BG , Samuel Schulter , Yun Fu , Manmohan Chandraker

While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational requirements and restricted number of calls constrain their practical utility, especially in…

Machine Learning · Computer Science 2024-05-07 Maryam Hashemzadeh , Elias Stengel-Eskin , Sarath Chandar , Marc-Alexandre Cote

Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions, current chart visual question answering (chart VQA) models suffer on complex…

Computer Vision and Pattern Recognition · Computer Science 2024-03-29 Zhuowan Li , Bhavan Jasani , Peng Tang , Shabnam Ghadar

Pre-trained language-vision models have shown remarkable performance on the visual question answering (VQA) task. However, most pre-trained models are trained by only considering monolingual learning, especially the resource-rich language…

Computation and Language · Computer Science 2021-09-13 Humair Raj Khan , Deepak Gupta , Asif Ekbal

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, their enormous parameter size and extremely high requirements for compute power pose challenges for…

Computation and Language · Computer Science 2024-03-26 Bohao Yang , Chen Tang , Kun Zhao , Chenghao Xiao , Chenghua Lin

Reasoning is increasingly crucial for various tasks. While chain-of-thought prompting enables large language models to leverage reasoning effectively, harnessing the reasoning capabilities of Vision-Language Models (VLMs) remains…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Guande Wu , Huan Song , Yawei Wang , Qiaojing Yan , Yijun Tian , Lin Lee Cheong , Panpan Xu

Large Multimodal Models (LMMs) often rely on in-context learning (ICL) to perform new visual question answering (VQA) tasks with minimal supervision. However, ICL performance, especially in smaller LMMs, does not always improve…

Artificial Intelligence · Computer Science 2026-03-03 Akash Gupta , Amos Storkey , Mirella Lapata

Generating enough and diverse data through augmentation offers an efficient solution to the time-consuming and labour-intensive process of collecting and annotating pixel-wise images. Traditional data augmentation techniques often face…

Computer Vision and Pattern Recognition · Computer Science 2025-10-08 Jiaojiao Ye , Jiaxing Zhong , Qian Xie , Yuzhou Zhou , Niki Trigoni , Andrew Markham

Natural Language Explanation (NLE) aims to elucidate the decision-making process by providing detailed, human-friendly explanations in natural language. It helps demystify the decision-making processes of large vision-language models…

Computation and Language · Computer Science 2024-12-10 Patrick Amadeus Irawan , Genta Indra Winata , Samuel Cahyawijaya , Ayu Purwarianti

How to generate descriptions from structured data organized in tables? Existing approaches using neural encoder-decoder models often suffer from lacking diversity. We claim that an open set of templates is crucial for enriching the phrase…

Computation and Language · Computer Science 2020-02-14 Rong Ye , Wenxian Shi , Hao Zhou , Zhongyu Wei , Lei Li

Data Augmentation (DA) -- generating extra training samples beyond original training set -- has been widely-used in today's unbiased VQA models to mitigate the language biases. Current mainstream DA strategies are synthetic-based methods,…

Computer Vision and Pattern Recognition · Computer Science 2022-09-16 Long Chen , Yuhang Zheng , Jun Xiao

Large language models (LLMs) have achieved state-of-the-art results in many natural language processing tasks. They have also demonstrated ability to adapt well to different tasks through zero-shot or few-shot settings. With the capability…

Computer Vision and Pattern Recognition · Computer Science 2023-09-28 Alvin De Jun Tan , Bingquan Shen

Humans apprehend the world through various sensory modalities, yet language is their predominant communication channel. Machine learning systems need to draw on the same multimodal richness to have informed discourses with humans in natural…

Computer Vision and Pattern Recognition · Computer Science 2022-08-25 Min Wang , Ata Mahjoubfar , Anupama Joshi

We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks; specifically, we modify images and videos by interpreting user stylistic requests, specified in…

Computation and Language · Computer Science 2024-10-11 Oren Sultan , Alex Khasin , Guy Shiran , Asnat Greenstein-Messica , Dafna Shahaf

The deployment of Large Language Models (LLMs) in customer support is constrained by hallucination (generating false information) and the high cost of proprietary models. To address these challenges, we propose a retrieval-augmented…

Computation and Language · Computer Science 2025-07-22 Ashley Lewis , Michael White , Jing Liu , Toshiaki Koike-Akino , Kieran Parsons , Ye Wang

Long-form video processing fundamentally challenges vision-language models (VLMs) due to the high computational costs of handling extended temporal sequences. Existing token pruning and feature merging methods often sacrifice critical…

Computation and Language · Computer Science 2025-09-11 Chuanqi Cheng , Jian Guan , Wei Wu , Rui Yan

Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non…

Computer Vision and Pattern Recognition · Computer Science 2023-06-08 Zaid Khan , Vijay Kumar BG , Samuel Schulter , Xiang Yu , Yun Fu , Manmohan Chandraker

We present a framework that formulates visual question answering as modular code generation. In contrast to prior work on modular approaches to VQA, our approach requires no additional training and relies on pre-trained language models…

‹ Prev 1 2 3 10 Next ›