Related papers: Masked Diffusion Captioning for Visual Feature Lea…

Image Difference Captioning with Pre-training and Contrastive Learning

The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language. The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require…

Multimedia · Computer Science 2022-02-10 Linli Yao , Weiying Wang , Qin Jin

Learning Visual Representations with Caption Annotations

Pretraining general-purpose visual features has become a crucial part of tackling many computer vision tasks. While one can learn such features on the extensively-annotated ImageNet dataset, recent approaches have looked at ways to allow…

Computer Vision and Pattern Recognition · Computer Science 2020-08-05 Mert Bulent Sariyildiz , Julien Perez , Diane Larlus

Cognitive resilience: Unraveling the proficiency of image-captioning models to interpret masked visual content

This study explores the ability of Image Captioning (IC) models to decode masked visual content sourced from diverse datasets. Our findings reveal the IC model's capability to generate captions from masked images, closely resembling the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Zhicheng Du , Zhaotian Xie , Huazhang Ying , Likun Zhang , Peiwu Qin

Enhancing Descriptive Captions with Visual Attributes for Multimodal Perception

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods for generating such captions often rely on distilling the captions from pretrained LMMs, constructing them…

Computer Vision and Pattern Recognition · Computer Science 2026-01-28 Yanpeng Sun , Jing Hao , Ke Zhu , Jiang-Jiang Liu , Yuxiang Zhao , Xiaofan Li , Na Zhao , Zechao Li , Jingdong Wang

DiffCap: Exploring Continuous Diffusion on Image Captioning

Current image captioning works usually focus on generating descriptions in an autoregressive manner. However, there are limited works that focus on generating descriptions non-autoregressively, which brings more decoding diversity. Inspired…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Yufeng He , Zefan Cai , Xu Gan , Baobao Chang

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image…

Computer Vision and Pattern Recognition · Computer Science 2022-12-12 Zixin Zhu , Yixuan Wei , Jianfeng Wang , Zhe Gan , Zheng Zhang , Le Wang , Gang Hua , Lijuan Wang , Zicheng Liu , Han Hu

Enhanced Modality Transition for Image Captioning

Image captioning model is a cross-modality knowledge discovery task, which targets at automatically describing an image with an informative and coherent sentence. To generate the captions, the previous encoder-decoder frameworks directly…

Computer Vision and Pattern Recognition · Computer Science 2021-02-24 Ziwei Wang , Yadan Luo , Zi Huang

Multi-modal reward for visual relationships-based image captioning

Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Ali Abedi , Hossein Karshenas , Peyman Adibi

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

Image Captioning based on Deep Reinforcement Learning

Recently it has shown that the policy-gradient methods for reinforcement learning have been utilized to train deep end-to-end systems on natural language processing tasks. What's more, with the complexity of understanding image content and…

Computer Vision and Pattern Recognition · Computer Science 2018-09-14 Haichao Shi , Peng Li , Bo Wang , Zhenyu Wang

Multimodal Data Augmentation for Image Captioning using Diffusion Models

Image captioning, an important vision-language task, often requires a tremendous number of finely labeled image-caption pairs for learning the underlying alignment between images and texts. In this paper, we proposed a multimodal data…

Computer Vision and Pattern Recognition · Computer Science 2023-11-14 Changrong Xiao , Sean Xin Xu , Kunpeng Zhang

Learning a Recurrent Visual Representation for Image Caption Generation

In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a…

Computer Vision and Pattern Recognition · Computer Science 2014-11-21 Xinlei Chen , C. Lawrence Zitnick

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

Text images contain both visual and linguistic information. However, existing pre-training techniques for text recognition mainly focus on either visual representation learning or linguistic knowledge learning. In this paper, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2023-10-11 Pengyuan Lyu , Chengquan Zhang , Shanshan Liu , Meina Qiao , Yangliu Xu , Liang Wu , Kun Yao , Junyu Han , Errui Ding , Jingdong Wang

Learning Visual Representations via Language-Guided Sampling

Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an…

Computer Vision and Pattern Recognition · Computer Science 2023-03-30 Mohamed El Banani , Karan Desai , Justin Johnson

Actor-Critic Sequence Training for Image Captioning

Generating natural language descriptions of images is an important capability for a robot or other visual-intelligence driven AI agent that may need to communicate with human users about what it is seeing. Such image captioning methods are…

Computer Vision and Pattern Recognition · Computer Science 2017-11-29 Li Zhang , Flood Sung , Feng Liu , Tao Xiang , Shaogang Gong , Yongxin Yang , Timothy M. Hospedales

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an important problem in computer vision. In this paper, we address image captioning by generating language descriptions of scenes without learning from annotated pairs of images…

Computer Vision and Pattern Recognition · Computer Science 2019-08-27 Iro Laina , Christian Rupprecht , Nassir Navab

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

Recent advancements in diffusion models have showcased their impressive capacity to generate visually striking images. Nevertheless, ensuring a close match between the generated image and the given prompt remains a persistent challenge. In…

Computer Vision and Pattern Recognition · Computer Science 2023-09-11 Yupeng Zhou , Daquan Zhou , Zuo-Liang Zhu , Yaxing Wang , Qibin Hou , Jiashi Feng

Text-guided Attention Model for Image Captioning

Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer…

Computer Vision and Pattern Recognition · Computer Science 2016-12-13 Jonghwan Mun , Minsu Cho , Bohyung Han

Learning Distinct and Representative Styles for Image Captioning

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward…

Computer Vision and Pattern Recognition · Computer Science 2023-08-16 Qi Chen , Chaorui Deng , Qi Wu

Contrastive Learning for Image Captioning

Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of…

Computer Vision and Pattern Recognition · Computer Science 2017-10-10 Bo Dai , Dahua Lin