Related papers: Multiple-Question Multiple-Answer Text-VQA
Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label on a warning sign warns people about the danger in the…
Visual Question Answering (VQA) presents a unique challenge as it requires the ability to understand and encode the multi-modal inputs - in terms of image processing and natural language processing. The algorithm further needs to learn how…
The increasing availability of multimodal data across text, tables, and images presents new challenges for developing models capable of complex cross-modal reasoning. Existing methods for Multimodal Multi-hop Question Answering (MMQA) often…
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image. Prior works directly evaluate the answering models by simply calculating the accuracy of predicted answers. However,…
A trending paradigm for multiple-choice question answering (MCQA) is using a text-to-text framework. By unifying data in different tasks into a single text-to-text format, it trains a generative encoder-decoder model which is both powerful…
Visual Question Answering (VQA) becomes one of the most active research problems in the medical imaging domain. A well-known VQA challenge is the intrinsic diversity between the image and text modalities, and in the medical VQA task, there…
Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content,…
Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple…
This paper presents an in-depth study of multimodal machine translation (MMT), examining the prevailing understanding that MMT systems exhibit decreased sensitivity to visual information when text inputs are complete. Instead, we attribute…
We present MCQA, a learning-based algorithm for multimodal question answering. MCQA explicitly fuses and aligns the multimodal input (i.e. text, audio, and video), which forms the context for the query (question and answer). Our approach…
When answering complex questions, people can seamlessly combine information from visual, textual and tabular sources. While interest in models that reason over multiple pieces of evidence has surged in recent years, there has been…
Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require a considerable amount of resources. In this paper, we propose…
Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a…
Textbook Question Answering (TQA) is a complex multimodal task to infer answers given large context descriptions and abundant diagrams. Compared with Visual Question Answering (VQA), TQA contains a large number of uncommon terminologies and…
Visual question answering (VQA) in medical imaging aims to support clinical diagnosis by automatically interpreting complex imaging data in response to natural language queries. Existing studies typically rely on distinct visual and textual…
Multi-hop Question Answering (MHQA) adds layers of complexity to question answering, making it more challenging. When Language Models (LMs) are prompted with multiple search results, they are tasked not only with retrieving relevant…
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly…
Visual question answering (VQA) demands simultaneous comprehension of both the image visual content and natural language questions. In some cases, the reasoning needs the help of common sense or general knowledge which usually appear in the…
The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are…
Document Visual Question Answering (DocVQA) requires models to jointly understand textual semantics, spatial layout, and visual features. Current methods struggle with explicit spatial relationship modeling, inefficiency with…