Related papers: Retrieval-augmented Image Captioning

Retrieval-Augmented Transformer for Image Captioning

Image captioning models aim at connecting Vision and Language by providing natural language descriptions of input images. In the past few years, the task has been tackled by learning parametric models and proposing visual feature extraction…

Computer Vision and Pattern Recognition · Computer Science 2022-08-23 Sara Sarto , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

Towards Retrieval-Augmented Architectures for Image Captioning

The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have…

Computer Vision and Pattern Recognition · Computer Science 2024-05-24 Sara Sarto , Marcella Cornia , Lorenzo Baraldi , Alessandro Nicolosi , Rita Cucchiara

CONCAP: Seeing Beyond English with Concepts Retrieval-Augmented Captioning

Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization.…

Computation and Language · Computer Science 2025-07-29 George Ibrahim , Rita Ramos , Yova Kementchedjhieva

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data

The aim of image captioning is to generate captions by machine to describe image contents. Despite many efforts, generating discriminative captions for images remains non-trivial. Most traditional approaches imitate the language structure…

Computer Vision and Pattern Recognition · Computer Science 2018-07-24 Xihui Liu , Hongsheng Li , Jing Shao , Dapeng Chen , Xiaogang Wang

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently, necessitating the requirement of sustaining up-to-date object…

Computer Vision and Pattern Recognition · Computer Science 2024-04-09 Jiaxuan Li , Duc Minh Vo , Akihiro Sugimoto , Hideki Nakayama

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

Multilingual Training-Free Remote Sensing Image Captioning

Remote sensing image captioning has advanced rapidly through encoder--decoder models, although the reliance on large annotated datasets and the focus on English restricts global applicability. To address these limitations, we propose the…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Carlos Rebelo , Gil Rocha , João Daniel Silva , Bruno Martins

A Semi-supervised Framework for Image Captioning

State-of-the-art approaches for image captioning require supervised training data consisting of captions with paired image data. These methods are typically unable to use unsupervised data such as textual data with no corresponding images,…

Computer Vision and Pattern Recognition · Computer Science 2017-06-27 Wenhu Chen , Aurelien Lucchi , Thomas Hofmann

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an…

Computation and Language · Computer Science 2023-06-01 Rita Ramos , Bruno Martins , Desmond Elliott

Image Captioning based on Feature Refinement and Reflective Decoding

Image captioning is the process of automatically generating a description of an image in natural language. Image captioning is one of the significant challenges in image understanding since it requires not only recognizing salient objects…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Ghadah Alabduljabbar , Hafida Benhidour , Said Kerrache

XGPT: Cross-modal Generative Pre-Training for Image Captioning

While many BERT-based cross-modal pre-trained models produce excellent results on downstream understanding tasks like image-text retrieval and VQA, they cannot be applied to generation tasks directly. In this paper, we propose XGPT, a new…

Computation and Language · Computer Science 2020-03-05 Qiaolin Xia , Haoyang Huang , Nan Duan , Dongdong Zhang , Lei Ji , Zhifang Sui , Edward Cui , Taroon Bharti , Xin Liu , Ming Zhou

Reflective Decoding Network for Image Captioning

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary…

Computer Vision and Pattern Recognition · Computer Science 2019-09-02 Lei Ke , Wenjie Pei , Ruiyu Li , Xiaoyong Shen , Yu-Wing Tai

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance…

Computer Vision and Pattern Recognition · Computer Science 2017-04-14 Zhou Ren , Xiaoyu Wang , Ning Zhang , Xutao Lv , Li-Jia Li

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge…

Computer Vision and Pattern Recognition · Computer Science 2021-04-19 Shir Gur , Natalia Neverova , Chris Stauffer , Ser-Nam Lim , Douwe Kiela , Austin Reiter

Exploring Annotation-free Image Captioning with Retrieval-augmented Pseudo Sentence Generation

Recently, training an image captioner without annotated image-sentence pairs has gained traction. Previous methods have faced limitations due to either using mismatched corpora for inaccurate pseudo annotations or relying on…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 Zhiyuan Li , Dongnan Liu , Heng Wang , Chaoyi Zhang , Weidong Cai

Vector Learning for Cross Domain Representations

Recently, generative adversarial networks have gained a lot of popularity for image generation tasks. However, such models are associated with complex learning mechanisms and demand very large relevant datasets. This work borrows concepts…

Machine Learning · Computer Science 2018-09-28 Shagan Sah , Chi Zhang , Thang Nguyen , Dheeraj Kumar Peri , Ameya Shringi , Raymond Ptucha

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Image Captioning, or the automatic generation of descriptions for images, is one of the core problems in Computer Vision and has seen considerable progress using Deep Learning Techniques. We propose to use Inception-ResNet Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Sulabh Katiyar , Samir Kumar Borgohain

Injecting Prior Knowledge into Image Caption Generation

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The…

Computation and Language · Computer Science 2020-08-07 Arushi Goel , Basura Fernando , Thanh-Son Nguyen , Hakan Bilen

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

The ability to quickly learn from a small quantity oftraining data widens the range of machine learning applications. In this paper, we propose a data-efficient image captioning model, VisualGPT, which leverages the linguistic knowledge…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Jun Chen , Han Guo , Kai Yi , Boyang Li , Mohamed Elhoseiny

Stylized image captioning systems aim to generate a caption not only semantically related to a given image but also consistent with a given style description. One of the biggest challenges with this task is the lack of sufficient paired…

Computer Vision and Pattern Recognition · Computer Science 2021-08-27 Guodun Li , Yuchen Zhai , Zehao Lin , Yin Zhang