Related papers: MSG-Chart: Multimodal Scene Graph for ChartQA

Graph-Based Multimodal Contrastive Learning for Chart Question Answering

Chart question answering (ChartQA) is challenged by the heterogeneous composition of chart elements and the subtle data patterns they encode. This work introduces a novel joint multimodal scene graph framework that explicitly models the…

Computation and Language · Computer Science 2025-04-08 Yue Dai , Soyeon Caren Han , Wei Liu

mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning

In the fields of computer vision and natural language processing, multimodal chart question-answering, especially involving color, structure, and textless charts, poses significant challenges. Traditional methods, which typically involve…

Computer Vision and Pattern Recognition · Computer Science 2024-04-03 Jingxuan Wei , Nan Xu , Guiyong Chang , Yin Luo , BiHui Yu , Ruifeng Guo

Understanding the Role of Scene Graphs in Visual Question Answering

Visual Question Answering (VQA) is of tremendous interest to the research community with important applications such as aiding visually impaired users and image-based search. In this work, we explore the use of scene graphs for solving the…

Computer Vision and Pattern Recognition · Computer Science 2021-01-19 Vinay Damodaran , Sharanya Chakravarthy , Akshay Kumar , Anjana Umapathy , Teruko Mitamura , Yuta Nakashima , Noa Garcia , Chenhui Chu

Advancing Chart Question Answering with Robust Chart Component Recognition

Chart comprehension presents significant challenges for machine learning models due to the diverse and intricate shapes of charts. Existing multimodal methods often overlook these visual features or fail to integrate them effectively for…

Computation and Language · Computer Science 2024-08-01 Hanwen Zheng , Sijia Wang , Chris Thomas , Lifu Huang

An Empirical Study on Leveraging Scene Graphs for Visual Question Answering

Visual question answering (Visual QA) has attracted significant attention these years. While a variety of algorithms have been proposed, most of them are built upon different combinations of image and language features as well as…

Computer Vision and Pattern Recognition · Computer Science 2019-07-30 Cheng Zhang , Wei-Lun Chao , Dong Xuan

SelfGraphVQA: A Self-Supervised Graph Neural Network for Scene-based Question Answering

The intersection of vision and language is of major interest due to the increased focus on seamless integration between recognition and reasoning. Scene graphs (SGs) have emerged as a useful tool for multimodal image analysis, showing…

Computer Vision and Pattern Recognition · Computer Science 2023-10-04 Bruno Souza , Marius Aasan , Helio Pedrini , Adín Ramírez Rivera

3D Question Answering for City Scene Understanding

3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments. While existing research has primarily focused on indoor household tasks…

Computer Vision and Pattern Recognition · Computer Science 2024-07-25 Penglei Sun , Yaoxian Song , Xiang Liu , Xiaofei Yang , Qiang Wang , Tiefeng Li , Yang Yang , Xiaowen Chu

InfoChartQA: A Benchmark for Multimodal Question Answering on Infographic Charts

Understanding infographic charts with design-driven visual elements (e.g., pictograms, icons) requires both visual recognition and reasoning, posing challenges for multimodal large language models (MLLMs). However, existing visual-question…

Computer Vision and Pattern Recognition · Computer Science 2025-10-30 Tianchi Xie , Minzhi Lin , Mengchen Liu , Yilin Ye , Changjian Chen , Shixia Liu

Multimodal grid features and cell pointers for Scene Text Visual Question Answering

This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it. The proposed model is based on…

Computer Vision and Pattern Recognition · Computer Science 2020-06-26 Lluís Gómez , Ali Furkan Biten , Rubèn Tito , Andrés Mafla , Marçal Rusiñol , Ernest Valveny , Dimosthenis Karatzas

MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems

Multimodal Large Language Models (MLLMs) have demonstrated impressive abilities across various tasks, including visual question answering and chart comprehension, yet existing benchmarks for chart-related tasks fall short in capturing the…

Computation and Language · Computer Science 2025-02-11 Zifeng Zhu , Mengzhao Jia , Zhihan Zhang , Lang Li , Meng Jiang

Rethinking Comprehensive Benchmark for Chart Understanding: A Perspective from Scientific Literature

Scientific Literature charts often contain complex visual elements, including multi-plot figures, flowcharts, structural diagrams and etc. Evaluating multimodal models using these authentic and intricate charts provides a more accurate…

Computation and Language · Computer Science 2024-12-18 Lingdong Shen , Qigqi , Kun Ding , Gaofeng Meng , Shiming Xiang

VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering

Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene;…

Computer Vision and Pattern Recognition · Computer Science 2023-09-18 Yanan Wang , Michihiro Yasunaga , Hongyu Ren , Shinya Wada , Jure Leskovec

Scene Graph Reasoning with Prior Visual Relationship for Visual Question Answering

One of the key issues of Visual Question Answering (VQA) is to reason with semantic clues in the visual content under the guidance of the question, how to model relational semantics still remains as a great challenge. To fully capture…

Multimedia · Computer Science 2019-08-22 Zhuoqian Yang , Zengchang Qin , Jing Yu , Yue Hu

Learning Conditioned Graph Structures for Interpretable Visual Question Answering

Visual Question answering is a challenging problem requiring a combination of concepts from Computer Vision and Natural Language Processing. Most existing approaches use a two streams strategy, computing image and question features that are…

Computer Vision and Pattern Recognition · Computer Science 2018-11-02 Will Norcliffe-Brown , Efstathios Vafeias , Sarah Parisot

Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

Charts are widely used to present complex information. Deriving meaningful insights in real-world contexts often requires interpreting multiple related charts together. Research on understanding multi-chart images has not been extensively…

Computation and Language · Computer Science 2026-04-24 Azher Ahmed Efat , Seok Hwan Song , Wallapak Tavanapong

Plugging Schema Graph into Multi-Table QA: A Human-Guided Framework for Reducing LLM Reliance

Large language models (LLMs) have shown promise in table Question Answering (Table QA). However, extending these capabilities to multi-table QA remains challenging due to unreliable schema linking across complex tables. Existing methods…

Artificial Intelligence · Computer Science 2025-11-25 Xixi Wang , Miguel Costa , Jordanka Kovaceva , Shuai Wang , Francisco C. Pereira

Joint learning of object graph and relation graph for visual question answering

Modeling visual question answering(VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which…

Computer Vision and Pattern Recognition · Computer Science 2022-05-10 Hao Li , Xu Li , Belhal Karimi , Jie Chen , Mingming Sun

Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text

Answering questions that require reading texts in an image is challenging for current models. One key difficulty of this task is that rare, polysemous, and ambiguous words frequently appear in images, e.g., names of places, products, and…

Computer Vision and Pattern Recognition · Computer Science 2020-04-01 Difei Gao , Ke Li , Ruiping Wang , Shiguang Shan , Xilin Chen

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Graph machine learning has made significant strides in recent years, yet the integration of visual information with graph structure and its potential for improving performance in downstream tasks remains an underexplored area. To address…

Machine Learning · Computer Science 2025-04-01 Jing Zhu , Yuhang Zhou , Shengyi Qian , Zhongmou He , Tong Zhao , Neil Shah , Danai Koutra

Multimodal Graph Transformer for Multimodal Question Answering

Despite the success of Transformer models in vision and language tasks, they often learn knowledge from enormous data implicitly and cannot utilize structured input data directly. On the other hand, structured learning approaches such as…

Computer Vision and Pattern Recognition · Computer Science 2023-05-02 Xuehai He , Xin Eric Wang