Related papers: Learning to Compose Dynamic Tree Structures for Vi…
Understanding realistic visual scene images together with language descriptions is a fundamental task towards generic visual understanding. Previous works have shown compelling comprehensive results by building hierarchical structures for…
Reasoning about images/objects and their hierarchical interactions is a key concept for the next generation of computer vision approaches. Here we present a new framework to deal with it through a visual hierarchical context-based…
A Concept Tree is a structure for storing knowledge where the trees are stored in a database called a Concept Base. It sits between the highly distributed neural architectures and the distributed information systems, with the intention of…
State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual…
Relations amongst entities play a central role in image understanding. Due to the complexity of modeling (subject, predicate, object) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also…
As computer vision continues to advance and finds widespread applications across various domains, the need for interpretability in deep learning models becomes paramount. Existing methods often resort to post-hoc techniques or prototypes to…
In this paper, we explore the potential of visual in-context learning to enable a single model to handle multiple tasks and adapt to new tasks during test time without re-training. Unlike previous approaches, our focus is on training…
Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input. VCR is a recently introduced visual scene understanding task with a wide range of applications, including visual question…
While much research has been done in text-to-image synthesis, little work has been done to explore the usage of linguistic structure of the input text. Such information is even more important for story visualization since its inputs have an…
The visual world around us can be described as a structured set of objects and their associated relations. An image of a room may be conjured given only the description of the underlying objects and their associated relations. While there…
As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to…
Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can…
The use of Large Language Models (LLMs) for generating Behavior Trees (BTs) has recently gained attention in the robotics community, yet remains in its early stages of development. In this paper, we propose a novel framework that leverages…
Understanding a visual scene incorporates objects, relationships, and context. Traditional methods working on an image mostly focus on object detection and fail to capture the relationship between the objects. Relationships can give rich…
This thesis tackles the problem of learning efficient representations of complex, structured data with a natural application to web page and element classification. We hypothesise that the context around the element inside the web page is…
This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods…
We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a…
Context plays an important role in visual recognition. Recent studies have shown that visual recognition networks can be fooled by placing objects in inconsistent contexts (e.g., a cow in the ocean). To model the role of contextual…
Objects in a scene are not always related. The execution efficiency of the one-stage scene graph generation approaches are quite high, which infer the effective relation between entity pairs using sparse proposal sets and a few queries.…
Exposing latent knowledge in geospatial trajectories has the potential to provide a better understanding of the movements of individuals and groups. Motivated by such a desire, this work presents the context tree, a new hierarchical data…