Related papers: Decoder-Only LLMs are Better Controllers for Diffu…

Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Bingqi Ma , Zhuofan Zong , Guanglu Song , Hongsheng Li , Yu Liu

A Comprehensive Study of Decoder-Only LLMs for Text-to-Image Generation

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Andrew Z. Wang , Songwei Ge , Tero Karras , Ming-Yu Liu , Yogesh Balaji

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating…

Computation and Language · Computer Science 2024-07-04 Chao-Wei Huang , Hui Lu , Hongyu Gong , Hirofumi Inaguma , Ilia Kulikov , Ruslan Mavlyutov , Sravya Popuri

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In…

Computer Vision and Pattern Recognition · Computer Science 2024-08-28 Mushui Liu , Yuhang Ma , Yang Zhen , Jun Dan , Yunlong Yu , Zeng Zhao , Zhipeng Hu , Bai Liu , Changjie Fan

On decoder-only architecture for speech-to-text and large language model integration

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-03 Jian Wu , Yashesh Gaur , Zhuo Chen , Long Zhou , Yimeng Zhu , Tianrui Wang , Jinyu Li , Shujie Liu , Bo Ren , Linquan Liu , Yu Wu

DEEM: Diffusion Models Serve as the Eyes of Large Language Models for Image Perception

The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Run Luo , Yunshui Li , Longze Chen , Wanwei He , Ting-En Lin , Ziqiang Liu , Lei Zhang , Zikai Song , Xiaobo Xia , Tongliang Liu , Min Yang , Binyuan Hui

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Long Lian , Boyi Li , Adam Yala , Trevor Darrell

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-11 Xiwei Hu , Rui Wang , Yixiao Fang , Bin Fu , Pei Cheng , Gang Yu

High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Ji Woo Hong , Hee Suk Yoon , Gwanhyeong Koo , Eunseop Yoon , SooHwan Eom , Qi Dai , Chong Luo , Chang D. Yoo

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Zhiyu Tan , Mengping Yang , Luozheng Qin , Hao Yang , Ye Qian , Qiang Zhou , Cheng Zhang , Hao Li

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Bingda Tang , Boyang Zheng , Xichen Pan , Sayak Paul , Saining Xie

Large Language Models are Universal Reasoners for Visual Generation

Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural…

Computer Vision and Pattern Recognition · Computer Science 2026-05-06 Sucheng Ren , Chen Chen , Zhenbang Wang , Liangchen Song , Xiangxin Zhu , Alan Yuille , Liang-Chieh Chen , Jiasen Lu

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Siqi Kou , Jiachun Jin , Zetong Zhou , Ye Ma , Yugang Wang , Quan Chen , Peng Jiang , Xiao Yang , Jun Zhu , Kai Yu , Zhijie Deng

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Hanan Gani , Shariq Farooq Bhat , Muzammal Naseer , Salman Khan , Peter Wonka

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Xiaohui Chen , Yongfei Liu , Yingxiang Yang , Jianbo Yuan , Quanzeng You , Li-Ping Liu , Hongxia Yang

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense…

Computation and Language · Computer Science 2023-11-30 Shanshan Zhong , Zhongzhan Huang , Wushao Wen , Jinghui Qin , Liang Lin

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit…

Computer Vision and Pattern Recognition · Computer Science 2025-02-26 Pengzhi Li , Pengfei Yu , Zide Liu , Wei He , Xuhao Pan , Xudong Rao , Tao Wei , Wei Chen

DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM

In the burgeoning field of natural language processing (NLP), Neural Topic Models (NTMs) , Large Language Models (LLMs) and Diffusion model have emerged as areas of significant research interest. Despite this, NTMs primarily utilize…

Computation and Language · Computer Science 2023-12-27 Weijie Xu , Wenxiang Hu , Fanyou Wu , Srinivasan Sengamedu

Mimir: Improving Video Diffusion Models for Precise Text Understanding

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text…

Computer Vision and Pattern Recognition · Computer Science 2024-12-05 Shuai Tan , Biao Gong , Yutong Feng , Kecheng Zheng , Dandan Zheng , Shuwei Shi , Yujun Shen , Jingdong Chen , Ming Yang

How to Train Your Latent Diffusion Language Model Jointly With the Latent Space

Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent…

Computation and Language · Computer Science 2026-05-11 Viacheslav Meshchaninov , Alexander Shabalin , Egor Chimbulatov , Nikita Gushchin , Ilya Koziev , Alexander Korotin , Dmitry Vetrov