Related papers: A new approach for encoding code and assisting cod…

Generative Image Coding with Diffusion Prior

As generative technologies advance, visual content has evolved into a complex mix of natural and AI-generated images, driving the need for more efficient coding techniques that prioritize perceptual quality. Traditional codecs and learned…

Computer Vision and Pattern Recognition · Computer Science 2025-09-18 Jianhui Chang

Controllable Text-to-Image Generation with GPT-4

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Tianjun Zhang , Yi Zhang , Vibhav Vineet , Neel Joshi , Xin Wang

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Siqi Kou , Jiachun Jin , Zetong Zhou , Ye Ma , Yugang Wang , Quan Chen , Peng Jiang , Xiao Yang , Jun Zhu , Kai Yu , Zhijie Deng

Prompting Large Vision-Language Models for Compositional Reasoning

Vision-language models such as CLIP have shown impressive capabilities in encoding texts and images into aligned embeddings, enabling the retrieval of multimodal data in a shared embedding space. However, these embedding-based models still…

Computer Vision and Pattern Recognition · Computer Science 2024-01-23 Timothy Ossowski , Ming Jiang , Junjie Hu

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Jingqi Tong , Yurong Mou , Hangcheng Li , Mingzhe Li , Yongzhuo Yang , Ming Zhang , Qiguang Chen , Tianyi Liang , Xiaomeng Hu , Yining Zheng , Xinchi Chen , Jun Zhao , Xuanjing Huang , Xipeng Qiu

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been…

Computer Vision and Pattern Recognition · Computer Science 2023-09-07 Kevin Clark , Priyank Jaini

Improving Next Tokens via Second-to-Last Predictions with Generate and Refine

Autoregressive language models like GPT aim to predict next tokens, while autoencoding models such as BERT are trained on tasks such as predicting masked tokens. We train a decoder-only architecture for predicting the second to last token…

Computation and Language · Computer Science 2025-02-17 Johannes Schneider

Beyond Autoregression: An Empirical Study of Diffusion Large Language Models for Code Generation

LLMs have become the mainstream approaches to code generation. Existing LLMs mainly employ autoregressive generation, i.e. generating code token-by-token from left to right. However, the underlying autoregressive generation has two…

Software Engineering · Computer Science 2025-11-04 Chengze Li , Yitong Zhang , Jia Li , Liyi Cai , Ge Li

Reusing Computation in Text-to-Image Diffusion for Efficient Generation of Image Sets

Text-to-image diffusion models enable high-quality image generation but are computationally expensive. While prior work optimizes per-inference efficiency, we explore an orthogonal approach: reducing redundancy across correlated prompts.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-29 Dale Decatur , Thibault Groueix , Wang Yifan , Rana Hanocka , Vladimir Kim , Matheus Gadelha

Text-To-Concept (and Back) via Cross-Model Alignment

We observe that the mapping between an image's representation in one model to its representation in another can be learned surprisingly well with just a linear layer, even across diverse models. Building on this observation, we propose…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Mazda Moayeri , Keivan Rezaei , Maziar Sanjabi , Soheil Feizi

NeuroPrompts: An Adaptive Framework to Optimize Prompts for Text-to-Image Generation

Despite impressive recent advances in text-to-image diffusion models, obtaining high-quality images often requires prompt engineering by humans who have developed expertise in using them. In this work, we present NeuroPrompts, an adaptive…

Artificial Intelligence · Computer Science 2024-04-09 Shachar Rosenman , Vasudev Lal , Phillip Howard

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Personalized text-to-image models allow users to generate varied styles of images (specified with a sentence) for an object (specified with a set of reference images). While remarkable results have been achieved using diffusion-based…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Fanyue Wei , Wei Zeng , Zhenyang Li , Dawei Yin , Lixin Duan , Wen Li

Cosmos: Compressed and Smooth Latent Space for Text Diffusion Modeling

Autoregressive language models dominate modern text generation, yet their sequential nature introduces fundamental limitations: decoding is slow, and maintaining global coherence remains challenging. Diffusion models offer a promising…

Computation and Language · Computer Science 2026-01-06 Viacheslav Meshchaninov , Egor Chimbulatov , Alexander Shabalin , Aleksandr Abramov , Dmitry Vetrov

Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models

The quality of the prompts provided to text-to-image diffusion models determines how faithful the generated content is to the user's intent, often requiring `prompt engineering'. To harness visual concepts from target images without prompt…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Shweta Mahajan , Tanzila Rahman , Kwang Moo Yi , Leonid Sigal

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP have revolutionized visual representation learning by providing good performance on downstream datasets. VLMs are 0-shot adapted to a downstream dataset by designing…

Computer Vision and Pattern Recognition · Computer Science 2023-08-09 Mayug Maniparambil , Chris Vorster , Derek Molloy , Noel Murphy , Kevin McGuinness , Noel E. O'Connor

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation

Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models. Current methods in this space require specific architectural changes to image…

Computer Vision and Pattern Recognition · Computer Science 2025-04-10 Diljeet Jagpal , Xi Chen , Vinay P. Namboodiri

Video-GPT via Next Clip Diffusion

GPT has shown its remarkable success in natural language processing. However, the language sequence is not sufficient to describe spatial-temporal details in the visual world. Alternatively, the video sequence is good at capturing such…

Computer Vision and Pattern Recognition · Computer Science 2025-05-22 Shaobin Zhuang , Zhipeng Huang , Ying Zhang , Fangyikang Wang , Canmiao Fu , Binxin Yang , Chong Sun , Chen Li , Yali Wang

Dual Diffusion for Unified Image Generation and Understanding

Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Zijie Li , Henry Li , Yichun Shi , Amir Barati Farimani , Yuval Kluger , Linjie Yang , Peng Wang

Lafite2: Few-shot Text-to-Image Generation

Text-to-image generation models have progressed considerably in recent years, which can now generate impressive realistic images from arbitrary text. Most of such models are trained on web-scale image-text paired datasets, which may not be…

Computer Vision and Pattern Recognition · Computer Science 2022-10-26 Yufan Zhou , Chunyuan Li , Changyou Chen , Jianfeng Gao , Jinhui Xu

Rejuvenating image-GPT as Strong Visual Representation Learners

This paper enhances image-GPT (iGPT), one of the pioneering works that introduce autoregressive pretraining to predict the next pixels for visual representation learning. Two simple yet essential changes are made. First, we shift the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Sucheng Ren , Zeyu Wang , Hongru Zhu , Junfei Xiao , Alan Yuille , Cihang Xie