Related papers: Structured Visual Search via Composition-aware Lea…
As the intermediate-level representations bridging the two levels, structured representations of visual scenes, such as visual relationships between pairwise objects, have been shown to not only benefit compositional models in learning to…
A structured query can capture the complexity of object interactions (e.g. 'woman rides motorcycle') unlike single objects (e.g. 'woman' or 'motorcycle'). Retrieval using structured queries therefore is much more useful than single object…
With the wide usage of data visualizations, a huge number of Scalable Vector Graphic (SVG)-based visualizations have been created and shared online. Accordingly, there has been an increasing interest in exploring how to retrieve…
Humans are highly efficient learners, with the ability to grasp the meaning of a new concept from just a few examples. Unlike popular computer vision systems, humans can flexibly leverage the compositional structure of the visual world,…
Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art…
Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query…
Efficient attention deployment in visual search is limited by human visual memory, yet this limitation can be offset by exploiting the environment's structure. This paper introduces a computational cognitive model that simulates how the…
We propose a computationally efficient and high-performance classification algorithm by incorporating class structural information in analysis dictionary learning. To achieve more consistent classification, we associate a class…
We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects. Sketch based image retrieval (SBIR) methods predominantly match queries containing a…
Compositional understanding is crucial for human intelligence, yet it remains unclear whether contemporary vision models exhibit it. The dominant machine learning paradigm is built on the premise that scaling data and model sizes will…
Numerous embedding models have been recently explored to incorporate semantic knowledge into visual recognition. Existing methods typically focus on minimizing the distance between the corresponding images and texts in the embedding space…
Hierarchies allow feature sharing between objects at multiple levels of representation, can code exponential variability in a very compact way and enable fast inference. This makes them potentially suitable for learning and recognizing a…
To identify the location of objects of a particular class, a passive computer vision system generally processes all the regions in an image to finally output few regions. However, we can use structure in the scene to search for objects…
People easily recognize new visual categories that are new combinations of known components. This compositional generalization capacity is critical for learning in real-world domains like vision and language because the long tail of new…
Recognizing visual entities in a natural language sentence and arranging them in a 2D spatial layout require a compositional understanding of language and space. This task of layout prediction is valuable in text-to-image synthesis as it…
Learning compositional representation is a key aspect of object-centric learning as it enables flexible systematic generalization and supports complex visual reasoning. However, most of the existing approaches rely on auto-encoding…
We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary.…
We propose a novel approach to improve a visual-semantic embedding model by incorporating concept representations captured from an external structured knowledge base. We investigate its performance on image classification under both…
Metric learning seeks to embed images of objects suchthat class-defined relations are captured by the embeddingspace. However, variability in images is not just due to different depicted object classes, but also depends on other latent…
In this vision paper, we propose a shift in perspective for improving the effectiveness of similarity search. Rather than focusing solely on enhancing the data quality, particularly machine learning-generated embeddings, we advocate for a…