Related papers: CLIP-Diffusion-LM: Apply Diffusion Model on Image …

Exploring Discrete Diffusion Models for Image Captioning

The image captioning task is typically realized by an auto-regressive method that decodes the text tokens one by one. We present a diffusion-based captioning model, dubbed the name DDCap, to allow more decoding flexibility. Unlike image…

Computer Vision and Pattern Recognition · Computer Science 2022-12-12 Zixin Zhu , Yixuan Wei , Jianfeng Wang , Zhe Gan , Zheng Zhang , Le Wang , Gang Hua , Lijuan Wang , Zicheng Liu , Han Hu

CLIP-VQDiffusion : Langauge Free Training of Text To Image generation using CLIP and vector quantized diffusion model

There has been a significant progress in text conditional image generation models. Recent advancements in this field depend not only on improvements in model structures, but also vast quantities of text-image paired datasets. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-25 Seungdae Han , Joohee Kim

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

While impressive performance has been achieved in image captioning, the limited diversity of the generated captions and the large parameter scale remain major barriers to the real-word application of these systems. In this work, we propose…

Computer Vision and Pattern Recognition · Computer Science 2023-10-18 Guisheng Liu , Yi Li , Zhengcong Fei , Haiyan Fu , Xiangyang Luo , Yanqing Guo

DiffCap: Exploring Continuous Diffusion on Image Captioning

Current image captioning works usually focus on generating descriptions in an autoregressive manner. However, there are limited works that focus on generating descriptions non-autoregressively, which brings more decoding diversity. Inspired…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Yufeng He , Zefan Cai , Xu Gan , Baobao Chang

Hierarchical Text-Conditional Image Generation with CLIP Latents

Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a…

Computer Vision and Pattern Recognition · Computer Science 2022-04-14 Aditya Ramesh , Prafulla Dhariwal , Alex Nichol , Casey Chu , Mark Chen

Dense Text-to-Image Generation with Attention Modulation

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , Jun-Yan Zhu

Guiding Image Captioning Models Toward More Specific Captions

Image captioning is conventionally formulated as the task of generating captions for images that match the distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Simon Kornblith , Lala Li , Zirui Wang , Thao Nguyen

Controlling Latent Diffusion Using Latent CLIP

Instead of performing text-conditioned denoising in the image domain, latent diffusion models (LDMs) operate in latent space of a variational autoencoder (VAE), enabling more efficient processing at reduced computational costs. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Jason Becker , Chris Wendler , Peter Baylies , Robert West , Christian Wressnegger

Diffusion Model-Based Image Editing: A Survey

Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Yi Huang , Jiancheng Huang , Yifan Liu , Mingfu Yan , Jiaxi Lv , Jianzhuang Liu , Wei Xiong , He Zhang , Liangliang Cao , Shifeng Chen

Phrase-based Image Captioning

Generating a novel textual description of an image is an interesting problem that connects computer vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a…

Computation and Language · Computer Science 2015-04-10 Rémi Lebret , Pedro O. Pinheiro , Ronan Collobert

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been…

Computer Vision and Pattern Recognition · Computer Science 2023-09-07 Kevin Clark , Priyank Jaini

Diffusion Probabilistic Modeling for Video Generation

Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in…

Computer Vision and Pattern Recognition · Computer Science 2022-12-09 Ruihan Yang , Prakhar Srivastava , Stephan Mandt

A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation

Text-to-image diffusion models achieved a remarkable leap in capabilities over the last few years, enabling high-quality and diverse synthesis of images from a textual prompt. However, even the most advanced models often struggle to…

Computer Vision and Pattern Recognition · Computer Science 2023-10-26 Eyal Segalis , Dani Valevski , Danny Lumen , Yossi Matias , Yaniv Leviathan

DiffVC: A Non-autoregressive Framework Based on Diffusion Model for Video Captioning

Current video captioning methods usually use an encoder-decoder structure to generate text autoregressively. However, autoregressive methods have inherent limitations such as slow generation speed and large cumulative error. Furthermore,…

Computer Vision and Pattern Recognition · Computer Science 2026-04-10 Junbo Wang , Liangyu Fu , Yuke Li , Yining Zhu , Ya Jing , Xuecheng Wu , Jiangbin Zheng

An Ensemble Model with Attention Based Mechanism for Image Captioning

Image captioning creates informative text from an input image by creating a relationship between the words and the actual content of an image. Recently, deep learning models that utilize transformers have been the most successful in…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Israa Al Badarneh , Bassam Hammo , Omar Al-Kadi

Variational Distribution Learning for Unsupervised Text-to-Image Generation

We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. In this work, instead of simply generating pseudo-ground-truth sentences of training images using…

Computer Vision and Pattern Recognition · Computer Science 2023-03-29 Minsoo Kang , Doyup Lee , Jiseob Kim , Saehoon Kim , Bohyung Han

From Text to Mask: Localizing Entities Using the Attention of Text-to-Image Diffusion Models

Diffusion models have revolted the field of text-to-image generation recently. The unique way of fusing text and image information contributes to their remarkable capability of generating highly text-related images. From another…

Computer Vision and Pattern Recognition · Computer Science 2024-10-02 Changming Xiao , Qi Yang , Feng Zhou , Changshui Zhang

Lossy Image Compression with Foundation Diffusion Models

Incorporating diffusion models in the image compression domain has the potential to produce realistic and detailed reconstructions, especially at extremely low bitrates. Previous methods focus on using diffusion models as expressive…

Image and Video Processing · Electrical Eng. & Systems 2024-10-10 Lucas Relic , Roberto Azevedo , Markus Gross , Christopher Schroers

Debiasing Classifiers by Amplifying Bias with Latent Diffusion and Large Language Models

Neural networks struggle with image classification when biases are learned and misleads correlations, affecting their generalization and performance. Previous methods require attribute labels (e.g. background, color) or utilizes Generative…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Donggeun Ko , Dongjun Lee , Namjun Park , Wonkyeong Shim , Jaekwang Kim

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

Remote sensing image change captioning (RSICC) aims at generating human-like language to describe the semantic changes between bi-temporal remote sensing image pairs. It provides valuable insights into environmental dynamics and land…

Computer Vision and Pattern Recognition · Computer Science 2024-05-22 Xiaofei Yu , Yitong Li , Jie Ma