English
Related papers

Related papers: 3M: Multi-style image caption generation using Mul…

200 papers

In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the…

Computation and Language · Computer Science 2021-11-19 Chengxi Li , Brent Harrison

Multilingual image captioning has recently been tackled by training with large-scale machine translated data, which is an expensive, noisy, and time-consuming process. Without requiring any multilingual caption data, we propose LMCap, an…

Computation and Language · Computer Science 2023-06-01 Rita Ramos , Bruno Martins , Desmond Elliott

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This…

Computer Vision and Pattern Recognition · Computer Science 2021-03-01 Marimuthu Kalimuthu , Aditya Mogadala , Marius Mosbach , Dietrich Klakow

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Yanpeng Sun , Jing Hao , Ke Zhu , Jiang-Jiang Liu , Yuxiang Zhao , Xiaofan Li , Na Zhao , Zechao Li , Jingdong Wang

This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed,…

Computer Vision and Pattern Recognition · Computer Science 2023-12-01 Marcella Cornia , Lorenzo Baraldi , Giuseppe Fiameni , Rita Cucchiara

Image captioning task has been extensively researched by previous work. However, limited experiments focus on generating captions based on non-autoregressive text decoder. Inspired by the recent success of the denoising diffusion model on…

Computer Vision and Pattern Recognition · Computer Science 2022-10-11 Shitong Xu

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are…

Computer Vision and Pattern Recognition · Computer Science 2015-06-12 Junhua Mao , Wei Xu , Yi Yang , Jiang Wang , Zhiheng Huang , Alan Yuille

Figure captions are crucial for helping readers understand and remember a figure's key message. Many models have been developed to generate these captions, helping authors compose better quality captions more easily. Yet, authors almost…

Existing image captioning systems are dedicated to generating narrative captions for images, which are spatially detached from the image in presentation. However, texts can also be used as decorations on the image to highlight the key…

Computer Vision and Pattern Recognition · Computer Science 2022-04-28 Yiqi Gao , Xinglin Hou , Yuanmeng Zhang , Tiezheng Ge , Yuning Jiang , Peng Wang

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based…

Computer Vision and Pattern Recognition · Computer Science 2019-05-21 Jun Yu , Jing Li , Zhou Yu , Qingming Huang

Multimodal Large Language Models (MLLMs) demonstrate a complex understanding of scenes, benefiting from large-scale and high-quality datasets. Most existing caption datasets lack the ground locations and relations for visual entities.…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Xiangtai Li , Tao Zhang , Yanwei Li , Haobo Yuan , Shihao Chen , Yikang Zhou , Jiahao Meng , Yueyi Sun , Shilin Xu , Lu Qi , Tianheng Cheng , Yi Lin , Zilong Huang , Wenhao Huang , Jiashi Feng , Guang Shi

Scientific figure captioning is a complex task that requires generating contextually appropriate descriptions of visual content. However, existing methods often fall short by utilizing incomplete information, treating the task solely as…

Automatically generating descriptive captions for images is a well-researched area in computer vision. However, existing evaluation approaches focus on measuring the similarity between two sentences disregarding fine-grained semantics of…

Computer Vision and Pattern Recognition · Computer Science 2019-08-07 Philipp Harzig , Dan Zecha , Rainer Lienhart , Carolin Kaiser , René Schallner

Humor, deeply rooted in societal meanings and cultural details, poses a unique challenge for machines. While advances have been made in natural language processing, real-world humor often thrives in a multi-modal context, encapsulated…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Yuyan Chen , Songzhou Yan , Zhihong Zhu , Zhixu Li , Yanghua Xiao

Image Captioning is a task that requires models to acquire a multi-modal understanding of the world and to express this understanding in natural language text. While the state-of-the-art for this task has rapidly improved in terms of n-gram…

Computer Vision and Pattern Recognition · Computer Science 2018-12-20 Annika Lindh , Robert J. Ross , Abhijit Mahalunkar , Giancarlo Salton , John D. Kelleher

Large pre-trained multimodal models have demonstrated significant success in a range of downstream tasks, including image captioning, image-text retrieval, visual question answering (VQA), etc. However, many of these methods rely on…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Zikang Liu , Sihan Chen , Longteng Guo , Handong Li , Xingjian He , Jing Liu

Research on generative models to produce human-aligned / human-preferred outputs has seen significant recent contributions. Between text and image-generative models, we narrowed our focus to text-based generative models, particularly to…

Computer Vision and Pattern Recognition · Computer Science 2024-03-12 Adarsh N L , Arun P , Aravindh N L

The recent progress on image recognition and language modeling is making automatic description of image content a reality. However, stylized, non-factual aspects of the written description are missing from the current systems. One such…

Computer Vision and Pattern Recognition · Computer Science 2015-12-15 Alexander Mathews , Lexing Xie , Xuming He

With great advances in vision and natural language processing, the generation of image captions becomes a need. In a recent paper, Mathews, Xie and He [1], extended a new model to generate styled captions by separating semantics and style.…

Computer Vision and Pattern Recognition · Computer Science 2022-02-03 Marzieh Heidari , Mehdi Ghatee , Ahmad Nickabadi , Arash Pourhasan Nezhad

Video captioning in essential is a complex natural process, which is affected by various uncertainties stemming from video content, subjective judgment, etc. In this paper we build on the recent progress in using encoder-decoder framework…

Computer Vision and Pattern Recognition · Computer Science 2017-10-23 Jingkuan Song , Yuyu Guo , Lianli Gao , Xuelong Li , Alan Hanjalic , Heng Tao Shen
‹ Prev 1 2 3 10 Next ›