English

Visual Question Answering as a Multi-Task Problem

Computer Vision and Pattern Recognition 2020-07-06 v1 Artificial Intelligence Computation and Language

Abstract

Visual Question Answering(VQA) is a highly complex problem set, relying on many sub-problems to produce reasonable answers. In this paper, we present the hypothesis that Visual Question Answering should be viewed as a multi-task problem, and provide evidence to support this hypothesis. We demonstrate this by reformatting two commonly used Visual Question Answering datasets, COCO-QA and DAQUAR, into a multi-task format and train these reformatted datasets on two baseline networks, with one designed specifically to eliminate other possible causes for performance changes as a result of the reformatting. Though the networks demonstrated in this paper do not achieve strongly competitive results, we find that the multi-task approach to Visual Question Answering results in increases in performance of 5-9% against the single-task formatting, and that the networks reach convergence much faster than in the single-task case. Finally we discuss possible reasons for the observed difference in performance, and perform additional experiments which rule out causes not associated with the learning of the dataset as a multi-task problem.

Keywords

Cite

@article{arxiv.2007.01780,
  title  = {Visual Question Answering as a Multi-Task Problem},
  author = {Amelia Elizabeth Pollard and Jonathan L. Shapiro},
  journal= {arXiv preprint arXiv:2007.01780},
  year   = {2020}
}
R2 v1 2026-06-23T16:50:06.808Z