Related papers: 3M: Multi-style image caption generation using Mul…

A Self-Explainable Stylish Image Captioning Framework via Multi-References

In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the…

Computation and Language · Computer Science 2021-11-19 Chengxi Li , Brent Harrison

LMCap: Few-shot Multilingual Image Captioning by Retrieval Augmented Language Model Prompting

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an…

Computation and Language · Computer Science 2023-06-01 Rita Ramos , Bruno Martins , Desmond Elliott

Fusion Models for Improved Visual Captioning

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Yanpeng Sun , Jing Hao , Ke Zhu , Jiang-Jiang Liu , Yuxiang Zhao , Xiaofan Li , Na Zhao , Zechao Li , Jingdong Wang

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed,…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Marcella Cornia , Lorenzo Baraldi , Giuseppe Fiameni , Rita Cucchiara

CLIP-Diffusion-LM: Apply Diffusion Model on Image Captioning

Image captioning task has been extensively researched by previous work. However, limited experiments focus on generating captions based on non-autoregressive text decoder. Inspired by the recent success of the denoising diffusion model on…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Shitong Xu

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are…

Computer Vision and Pattern Recognition · Computer Science 2015-06-12 Junhua Mao , Wei Xu , Yi Yang , Jiang Wang , Zhiheng Huang , Alan Yuille

LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles

Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost…

Computation and Language · Computer Science 2025-09-24 Ho Yin 'Sam' Ng , Ting-Yao Hsu , Aashish Anantha Ramakrishnan , Branislav Kveton , Nedim Lipka , Franck Dernoncourt , Dongwon Lee , Tong Yu , Sungchul Kim , Ryan A. Rossi , Ting-Hao 'Kenneth' Huang

CapOnImage: Context-driven Dense-Captioning on Image

Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key…

Computer Vision and Pattern Recognition · Computer Science 2022-04-28 Yiqi Gao , Xinglin Hou , Yuanmeng Zhang , Tiezheng Ge , Yuning Jiang , Peng Wang

Multimodal Transformer with Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based…

Computer Vision and Pattern Recognition · Computer Science 2019-05-21 Jun Yu , Jing Li , Zhou Yu , Qingming Huang

DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World

Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Xiangtai Li , Tao Zhang , Yanwei Li , Haobo Yuan , Shihao Chen , Yikang Zhou , Jiahao Meng , Yueyi Sun , Shilin Xu , Lu Qi , Tianheng Cheng , Yi Lin , Zilong Huang , Wenhao Huang , Jiashi Feng , Guang Shi

Multi-LLM Collaborative Caption Generation in Scientific Documents

Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as…

Computation and Language · Computer Science 2025-01-07 Jaeyoung Kim , Jongho Lee , Hong-Jun Choi , Ting-Yao Hsu , Chieh-Yang Huang , Sungchul Kim , Ryan Rossi , Tong Yu , Clyde Lee Giles , Ting-Hao 'Kenneth' Huang , Sungchul Choi

Image Captioning with Clause-Focused Metrics in a Multi-Modal Setting for Marketing

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Philipp Harzig , Dan Zecha , Rainer Lienhart , Carolin Kaiser , René Schallner

XMeCap: Meme Caption Generation with Sub-Image Adaptability

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Yuyan Chen , Songzhou Yan , Zhihong Zhu , Zhixu Li , Yanghua Xiao

Generating Diverse and Meaningful Captions

Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram…

Computer Vision and Pattern Recognition · Computer Science 2018-12-20 Annika Lindh , Robert J. Ross , Abhijit Mahalunkar , Giancarlo Salton , John D. Kelleher

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Zikang Liu , Sihan Chen , Longteng Guo , Handong Li , Xingjian He , Jing Liu

Enhancing Image Caption Generation Using Reinforcement Learning with Human Feedback

Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Adarsh N L , Arun P , Aravindh N L

SentiCap: Generating Image Descriptions with Sentiments

The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such…

Computer Vision and Pattern Recognition · Computer Science 2015-12-15 Alexander Mathews , Lexing Xie , Xuming He

Diverse and Styled Image Captioning Using SVD-Based Mixture of Recurrent Experts

With great advances in vision and natural language processing, the generation of image captions becomes a need. In a recent paper, Mathews, Xie and He [1], extended a new model to generate styled captions by separating semantics and style.…

Computer Vision and Pattern Recognition · Computer Science 2022-02-03 Marzieh Heidari , Mehdi Ghatee , Ahmad Nickabadi , Arash Pourhasan Nezhad

From Deterministic to Generative: Multi-Modal Stochastic RNNs for Video Captioning

Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework…

Computer Vision and Pattern Recognition · Computer Science 2017-10-23 Jingkuan Song , Yuyu Guo , Lianli Gao , Xuelong Li , Alan Hanjalic , Heng Tao Shen