Related papers: Deep image representations using caption generator…
Automated image captioning is one of the applications of Deep Learning which involves fusion of work done in computer vision and natural language processing, and it is typically performed using Encoder-Decoder architectures. In this…
Most of the approaches for discovering visual attributes in images demand significant supervision, which is cumbersome to obtain. In this paper, we aim to discover visual attributes in a weakly supervised setting that is commonly…
The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such…
Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal is to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images,…
Scene labeling is a challenging classification problem where each input image requires a pixel-level prediction map. Recently, deep-learning-based methods have shown their effectiveness on solving this problem. However, we argue that the…
This research explores the realm of neural image captioning using deep learning models. The study investigates the performance of different neural architecture configurations, focusing on the inject architecture, and proposes a novel…
Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful…
Recently, intermediate feature maps of pre-trained convolutional neural networks have shown significant perceptual quality improvements, when they are used in the loss function for training new networks. It is believed that these features…
While advanced image captioning systems are increasingly describing images coherently and exactly, recent progress in continual learning allows deep learning models to avoid catastrophic forgetting. However, the domain where image…
Deep learning is found to be vulnerable to adversarial examples. However, its adversarial susceptibility in image caption generation is under-explored. We study adversarial examples for vision and language models, which typically adopt an…
Recently it has shown that the policy-gradient methods for reinforcement learning have been utilized to train deep end-to-end systems on natural language processing tasks. What's more, with the complexity of understanding image content and…
While image-text representation learning has become very popular in recent years, existing models tend to lack spatial awareness and have limited direct applicability for dense understanding tasks. For this reason, self-supervised…
Scene parsing is an important and challenging prob- lem in computer vision. It requires labeling each pixel in an image with the category it belongs to. Tradition- ally, it has been approached with hand-engineered features from color…
We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions…
Semantic labeling (or pixel-level land-cover classification) in ultra-high resolution imagery (< 10cm) requires statistical models able to learn high level concepts from spatial data, with large appearance variations. Convolutional Neural…
Image captioning is a research area of immense importance, aiming to generate natural language descriptions for visual content in the form of still images. The advent of deep learning and more recently vision-language pre-training…
In recent years Convolutional neural networks (CNN) have made significant progress in computer vision. These advancements have been applied to other areas, such as remote sensing and have shown satisfactory results. However, the lack of…
Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically…
Deep neural networks have achieved great successes on the image captioning task. However, most of the existing models depend heavily on paired image-sentence datasets, which are very expensive to acquire. In this paper, we make the first…
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a…