Related papers: Dependent Multi-Task Learning with Causal Interven…

Image Captioning based on Deep Reinforcement Learning

Recently it has shown that the policy-gradient methods for reinforcement learning have been utilized to train deep end-to-end systems on natural language processing tasks. What's more, with the complexity of understanding image content and…

Computer Vision and Pattern Recognition · Computer Science 2018-09-14 Haichao Shi , Peng Li , Bo Wang , Zhenyu Wang

Masked Diffusion Captioning for Visual Feature Learning

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-31 Chao Feng , Zihao Wei , Andrew Owens

Deep Reinforcement Learning-based Image Captioning with Embedding Reward

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance…

Computer Vision and Pattern Recognition · Computer Science 2017-04-14 Zhou Ren , Xiaoyu Wang , Ning Zhang , Xutao Lv , Li-Jia Li

Improving Image Captioning with Better Use of Captions

Image captioning is a multimodal problem that has drawn extensive attention in both the natural language processing and computer vision community. In this paper, we present a novel image captioning architecture to better explore semantics…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhan Shi , Xu Zhou , Xipeng Qiu , Xiaodan Zhu

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require…

Multimedia · Computer Science 2022-02-10 Linli Yao , Weiying Wang , Qin Jin

ContCap: A scalable framework for continual image captioning

While advanced image captioning systems are increasingly describing images coherently and exactly, recent progress in continual learning allows deep learning models to avoid catastrophic forgetting. However, the domain where image…

Computer Vision and Pattern Recognition · Computer Science 2020-04-22 Giang Nguyen , Tae Joon Jun , Trung Tran , Tolcha Yalew , Daeyoung Kim

Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning

Observing a set of images and their corresponding paragraph-captions, a challenging task is to learn how to produce a semantically coherent paragraph to describe the visual content of an image. Inspired by recent successes in integrating…

Computer Vision and Pattern Recognition · Computer Science 2022-07-27 Dandan Guo , Ruiying Lu , Bo Chen , Zequn Zeng , Mingyuan Zhou

Learning Distinct and Representative Styles for Image Captioning

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Qi Chen , Chaorui Deng , Qi Wu

Implicit and Explicit Commonsense for Multi-sentence Video Captioning

Existing dense or paragraph video captioning approaches rely on holistic representations of videos, possibly coupled with learned object/action representations, to condition hierarchical language decoders. However, they fundamentally lack…

Computer Vision and Pattern Recognition · Computer Science 2024-01-10 Shih-Han Chou , James J. Little , Leonid Sigal

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Yanpeng Sun , Jing Hao , Ke Zhu , Jiang-Jiang Liu , Yuxiang Zhao , Xiaofan Li , Na Zhao , Zechao Li , Jingdong Wang

RATT: Recurrent Attention to Transient Tasks for Continual Image Captioning

Research on continual learning has led to a variety of approaches to mitigating catastrophic forgetting in feed-forward classification networks. Until now surprisingly little attention has been focused on continual learning of recurrent…

Computer Vision and Pattern Recognition · Computer Science 2020-10-30 Riccardo Del Chiaro , Bartłomiej Twardowski , Andrew D. Bagdanov , Joost van de Weijer

Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting for Marketing

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Philipp Harzig , Dan Zecha , Rainer Lienhart , Carolin Kaiser , René Schallner

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an…

Computation and Language · Computer Science 2023-06-01 Rita Ramos , Bruno Martins , Desmond Elliott

CapText: Large Language Model-based Caption Generation From Image Context and Description

While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary…

Machine Learning · Computer Science 2023-06-07 Shinjini Ghosh , Sagnik Anupam

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing…

Computer Vision and Pattern Recognition · Computer Science 2018-03-15 Jiuxiang Gu , Jianfei Cai , Gang Wang , Tsuhan Chen

Patch Matters: Training-free Fine-grained Image Caption Enhancement via Local Perception

High-quality image captions play a crucial role in improving the performance of cross-modal applications such as text-to-image generation, text-to-video generation, and text-image retrieval. To generate long-form, high-quality captions,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-10 Ruotian Peng , Haiying He , Yake Wei , Yandong Wen , Di Hu

A Comprehensive Survey of Deep Learning for Image Captioning

Generating a description of an image is called image captioning. Image captioning requires to recognize the important objects, their attributes and their relationships in an image. It also needs to generate syntactically and semantically…

Computer Vision and Pattern Recognition · Computer Science 2018-10-16 Md. Zakir Hossain , Ferdous Sohel , Mohd Fairuz Shiratuddin , Hamid Laga

Partially-Supervised Image Captioning

Image captioning models are becoming increasingly successful at describing the content of images in restricted domains. However, if these models are to function in the wild - for example, as assistants for people with impaired vision - a…

Computer Vision and Pattern Recognition · Computer Science 2018-11-29 Peter Anderson , Stephen Gould , Mark Johnson

CapOnImage: Context-driven Dense-Captioning on Image

Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key…

Computer Vision and Pattern Recognition · Computer Science 2022-04-28 Yiqi Gao , Xinglin Hou , Yuanmeng Zhang , Tiezheng Ge , Yuning Jiang , Peng Wang

Injecting Prior Knowledge into Image Caption Generation

Automatically generating natural language descriptions from an image is a challenging problem in artificial intelligence that requires a good understanding of the visual and textual signals and the correlations between them. The…

Computation and Language · Computer Science 2020-08-07 Arushi Goel , Basura Fernando , Thanh-Son Nguyen , Hakan Bilen