Related papers: Length-Controllable Image Captioning

CLID: Controlled-Length Image Descriptions with Limited Data

Controllable image captioning models generate human-like image descriptions, enabling some kind of control over the generated captions. This paper focuses on controlling the caption length, i.e. a short and concise description or a long and…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Elad Hirsch , Ayellet Tal

Image Captioning with Deep Bidirectional LSTMs

This work presents an end-to-end trainable deep bidirectional LSTM (Long-Short Term Memory) model for image captioning. Our model builds on a deep convolutional neural network (CNN) and two separate LSTM networks. It is capable of learning…

Computer Vision and Pattern Recognition · Computer Science 2016-07-21 Cheng Wang , Haojin Yang , Christian Bartz , Christoph Meinel

Masked Non-Autoregressive Image Captioning

Existing captioning models often adopt the encoder-decoder architecture, where the decoder uses autoregressive decoding to generate captions, such that each token is generated sequentially given the preceding generated tokens. However,…

Computer Vision and Pattern Recognition · Computer Science 2019-06-04 Junlong Gao , Xi Meng , Shiqi Wang , Xia Li , Shanshe Wang , Siwei Ma , Wen Gao

Non-Autoregressive Coarse-to-Fine Video Captioning

It is encouraged to see that progress has been made to bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due to the sequential manner of autoregressive decoding, and prefer…

Computer Vision and Pattern Recognition · Computer Science 2021-03-25 Bang Yang , Yuexian Zou , Fenglin Liu , Can Zhang

Controllable Image Captioning via Prompting

Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Ning Wang , Jiahao Xie , Jihao Wu , Mingbo Jia , Linlin Li

Semi-Autoregressive Transformer for Image Captioning

Current state-of-the-art image captioning models adopt autoregressive decoders, \ie they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue,…

Computer Vision and Pattern Recognition · Computer Science 2021-08-18 Yuanen Zhou , Yong Zhang , Zhenzhen Hu , Meng Wang

Controllable Image Captioning

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely…

Computer Vision and Pattern Recognition · Computer Science 2022-05-26 Luka Maxwell

Fast Image Caption Generation with Position Alignment

Recent neural network models for image captioning usually employ an encoder-decoder architecture, where the decoder adopts a recursive sequence decoding way. However, such autoregressive decoding may result in sequential error accumulation…

Computer Vision and Pattern Recognition · Computer Science 2019-12-16 Zheng-cong Fei

Semi-Autoregressive Image Captioning

Current state-of-the-art approaches for image captioning typically adopt an autoregressive manner, i.e., generating descriptions word by word, which suffers from slow decoding issue and becomes a bottleneck in real-time applications.…

Computer Vision and Pattern Recognition · Computer Science 2021-10-14 Xu Yan , Zhengcong Fei , Zekang Li , Shuhui Wang , Qingming Huang , Qi Tian

Macroscopic Control of Text Generation for Image Captioning

Despite the fact that image captioning models have been able to generate impressive descriptions for a given image, challenges remain: (1) the controllability and diversity of existing models are still far from satisfactory; (2) models…

Computer Vision and Pattern Recognition · Computer Science 2021-01-21 Zhangzi Zhu , Tianlei Wang , Hong Qu

Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

State-of-The-Art (SoTA) image captioning models are often trained on the MicroSoft Common Objects in Context (MS-COCO) dataset, which contains human-annotated captions with an average length of approximately ten tokens. Although effective…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Luigi Celona , Simone Bianco , Marco Donzella , Paolo Napoletano

Fine-grained length controllable video captioning with ordinal embeddings

This paper proposes a method for video captioning that controls the length of generated captions. Previous work on length control often had few levels for expressing length. In this study, we propose two methods of length embedding for…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Tomoya Nitta , Takumi Fukuzawa , Toru Tamaki

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image…

Computer Vision and Pattern Recognition · Computer Science 2022-12-12 Zixin Zhu , Yixuan Wei , Jianfeng Wang , Zhe Gan , Zheng Zhang , Le Wang , Gang Hua , Lijuan Wang , Zicheng Liu , Han Hu

Generating an Image From 1,000 Words: Enhancing Text-to-Image With Structured Captions

Text-to-image models have rapidly evolved from casual creative tools to professional-grade systems, achieving unprecedented levels of image quality and realism. Yet, most models are trained to map short prompts into detailed images,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-11 Eyal Gutflaish , Eliran Kachlon , Hezi Zisman , Tal Hacham , Nimrod Sarid , Alexander Visheratin , Saar Huberman , Gal Davidi , Guy Bukchin , Kfir Goldberg , Ron Mokady

COMIC: Towards A Compact Image Captioning Model with Attention

Recent works in image captioning have shown very promising raw performance. However, we realize that most of these encoder-decoder style networks with attention do not scale naturally to large vocabulary size, making them difficult to be…

Computer Vision and Pattern Recognition · Computer Science 2019-06-13 Jia Huei Tan , Chee Seng Chan , Joon Huang Chuah

SmallCap: Lightweight Image Captioning Prompted with Retrieval Augmentation

Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Rita Ramos , Bruno Martins , Desmond Elliott , Yova Kementchedjhieva

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Generating textual descriptions for images has been an attractive problem for the computer vision and natural language processing researchers in recent years. Dozens of models based on deep learning have been proposed to solve this problem.…

Computer Vision and Pattern Recognition · Computer Science 2019-07-01 Ahmad Asadi , Reza Safabakhsh

Convolutional Image Captioning

Image captioning is an important but challenging task, applicable to virtual assistants, editing tools, image indexing, and support of the disabled. Its challenges are due to the variability and ambiguity of possible image descriptions. In…

Computer Vision and Pattern Recognition · Computer Science 2017-11-28 Jyoti Aneja , Aditya Deshpande , Alexander Schwing

Image Captioning using Deep Stacked LSTMs, Contextual Word Embeddings and Data Augmentation

Image Captioning, or the automatic generation of descriptions for images, is one of the core problems in Computer Vision and has seen considerable progress using Deep Learning Techniques. We propose to use Inception-ResNet Convolutional…

Computer Vision and Pattern Recognition · Computer Science 2021-02-23 Sulabh Katiyar , Samir Kumar Borgohain

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images…

Computer Vision and Pattern Recognition · Computer Science 2019-08-27 Iro Laina , Christian Rupprecht , Nassir Navab