Related papers: Exploring Sparse Spatial Relation in Graph Inferen…
One of the key issues of Visual Question Answering (VQA) is to reason with semantic clues in the visual content under the guidance of the question, how to model relational semantics still remains as a great challenge. To fully capture…
Over the past few years, a significant progress has been made in deep convolutional neural networks (CNNs)-based image recognition. This is mainly due to the strong ability of such networks in mining discriminative object pose and parts…
Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying…
Scene graph generation (SGG) is to detect object pairs with their relations in an image. Existing SGG approaches often use multi-stage pipelines to decompose this task into object detection, relation graph construction, and dense or…
In order to answer semantically-complicated questions about an image, a Visual Question Answering (VQA) model needs to fully understand the visual scene in the image, especially the interactive dynamics between different objects. We propose…
Previous studies such as VizWiz find that Visual Question Answering (VQA) systems that can read and reason about text in images are useful in application areas such as assisting visually-impaired people. TextVQA is a VQA dataset geared…
The main challenge in video question answering (VideoQA) is to capture and understand the complex spatial and temporal relations between objects based on given questions. Existing graph-based methods for VideoQA usually ignore keywords in…
Text-based Visual Question Answering~(TextVQA) aims to produce correct answers for given questions about the images with multiple scene texts. In most cases, the texts naturally attach to the surface of the objects. Therefore, spatial…
Recently, graph neural networks (GNNs) have been widely used for document classification. However, most existing methods are based on static word co-occurrence graphs without sentence-level information, which poses three challenges:(1) word…
Visual commonsense reasoning (VCR) is a challenging multi-modal task, which requires high-level cognition and commonsense reasoning ability about the real world. In recent years, large-scale pre-training approaches have been developed and…
Outstanding achievements of graph neural networks for spatiotemporal time series analysis show that relational constraints introduce an effective inductive bias into neural forecasting architectures. Often, however, the relational…
For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects are the important cue to understand the contextual information presented in the video. With the effective…
The paper addresses challenges in storing and retrieving sequences in contexts like anomaly detection, behavior prediction, and genetic information analysis. Associative Knowledge Graphs (AKGs) offer a promising approach by leveraging…
Comprehensive visual understanding requires detection frameworks that can effectively learn and utilize object interactions while analyzing objects individually. This is the main objective in Human-Object Interaction (HOI) detection task.…
Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches…
Modeling visual question answering(VQA) through scene graphs can significantly improve the reasoning accuracy and interpretability. However, existing models answer poorly for complex reasoning questions with attributes or relations, which…
Spatio-temporal forecasting of future values of spatially correlated time series is important across many cyber-physical systems (CPS). Recent studies offer evidence that the use of graph neural networks to capture latent correlations…
With the process of urbanization and the rapid growth of population, the issue of traffic congestion has become an increasingly critical concern. Intelligent transportation systems heavily rely on real-time and precise prediction algorithms…
Virtual sensing techniques allow for inferring signals at new unmonitored locations by exploiting spatio-temporal measurements coming from physical sensors at different locations. However, as the sensor coverage becomes sparse due to costs…
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless,…