Related papers: Deep Bayesian Active Learning for Multiple Correct…
We present an empirical study of active learning for Visual Question Answering, where a deep VQA model selects informative question-image pairs from a pool and queries an oracle for answers to maximally improve its performance under a…
The Visual Question Answering (VQA) task combines challenges for processing data with both Visual and Linguistic processing, to answer basic `common sense' questions about given images. Given an image and a question in natural language, the…
The predominant approach to Visual Question Answering (VQA) demands that the model represents within its weights all of the information required to answer any question about any image. Learning this information from any real training set…
We propose a novel probabilistic model for visual question answering (Visual QA). The key idea is to infer two sets of embeddings: one for the image and the question jointly and the other for the answers. The learning objective is to learn…
3D Visual Question Answering (3D VQA) is crucial for enabling models to perceive the physical world and perform spatial reasoning. In 3D VQA, the free-form nature of answers often leads to improper annotations that can confuse or mislead…
The field of visual question answering (VQA) has recently seen a surge in research focused on providing explanations for predicted answers. However, current systems mostly rely on separate models to predict answers and generate…
Visual question answering as recently proposed multimodal learning task has enjoyed wide attention from the deep learning community. Lately, the focus was on developing new representation fusion methods and attention mechanisms to achieve…
In visual question answering (VQA), an algorithm must answer text-based questions about images. While multiple datasets for VQA have been created since late 2014, they all have flaws in both their content and the way algorithms are…
Visual question answering (VQA) refers to the problem where, given an image and a natural language question about the image, a correct natural language answer has to be generated. A VQA model has to demonstrate both the visual understanding…
Despite remarkable progress in recent years, Vision Language Models (VLMs) remain prone to overconfidence and hallucinations on tasks such as Visual Question Answering (VQA) and Visual Reasoning. Bayesian methods can potentially improve…
Visual question answering (or VQA) is a new and exciting problem that combines natural language processing and computer vision techniques. We present a survey of the various datasets and models that have been used to tackle this task. The…
Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms…
This paper focuses on answering fill-in-the-blank style multiple choice questions from the Visual Madlibs dataset. Previous approaches to Visual Question Answering (VQA) have mainly used generic image features from networks trained on the…
One of the most intriguing features of the Visual Question Answering (VQA) challenge is the unpredictability of the questions. Extracting the information required to answer them demands a variety of image operations from detection and…
Visual question answering (VQA) is a task that combines both the techniques of computer vision and natural language processing. It requires models to answer a text-based question according to the information contained in a visual. In recent…
Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories…
Recently, a number of deep-learning based models have been proposed for the task of Visual Question Answering (VQA). The performance of most models is clustered around 60-70%. In this paper we propose systematic methods to analyze the…
Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer vision and the natural language processing communities. Given an image and a question in natural language, it requires…
Machine learning has advanced dramatically, narrowing the accuracy gap to humans in multimodal tasks like visual question answering (VQA). However, while humans can say "I don't know" when they are uncertain (i.e., abstain from answering a…
Visual Question Answering (VQA) has attracted much attention since it offers insight into the relationships between the multi-modal analysis of images and natural language. Most of the current algorithms are incapable of answering…