English
Related papers

Related papers: Decoder-Only LLMs are Better Controllers for Diffu…

200 papers

Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Bingqi Ma , Zhuofan Zong , Guanglu Song , Hongsheng Li , Yu Liu

Both text-to-image generation and large language models (LLMs) have made significant advancements. However, many text-to-image models still employ the somewhat outdated T5 and CLIP as their text encoders. In this work, we investigate the…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Andrew Z. Wang , Songwei Ge , Tero Karras , Ming-Yu Liu , Yogesh Balaji

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating…

Computation and Language · Computer Science 2024-07-04 Chao-Wei Huang , Hui Lu , Hongyu Gong , Hirofumi Inaguma , Ilia Kulikov , Ruslan Mavlyutov , Sravya Popuri

Diffusion models have exhibited substantial success in text-to-image generation. However, they often encounter challenges when dealing with complex and dense prompts involving multiple objects, attribute binding, and long descriptions. In…

Computer Vision and Pattern Recognition · Computer Science 2024-08-28 Mushui Liu , Yuhang Ma , Yang Zhen , Jun Dan , Yunlong Yu , Zeng Zhao , Zhipeng Hu , Bai Liu , Changjie Fan

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-03 Jian Wu , Yashesh Gaur , Zhuo Chen , Long Zhou , Yimeng Zhu , Tianrui Wang , Jinyu Li , Shujie Liu , Bo Ren , Linquan Liu , Yu Wu

The development of large language models (LLMs) has significantly advanced the emergence of large multimodal models (LMMs). While LMMs have achieved tremendous success by promoting the synergy between multimodal comprehension and creation,…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Run Luo , Yunshui Li , Longze Chen , Wanwei He , Ting-En Lin , Ziqiang Liu , Lei Zhang , Zikai Song , Xiaobo Xia , Tongliang Liu , Min Yang , Binyuan Hui

Recent advancements in text-to-image diffusion models have yielded impressive results in generating realistic and diverse images. However, these models still struggle with complex prompts, such as those that involve numeracy and spatial…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Long Lian , Boyi Li , Adam Yala , Trevor Darrell

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts,…

Computer Vision and Pattern Recognition · Computer Science 2024-03-11 Xiwei Hu , Rui Wang , Yixiao Fang , Bin Fu , Pei Cheng , Gang Yu

Recent large-scale vision-language models (VLMs) have shown remarkable text-to-image generation capabilities, yet their visual fidelity remains constrained by the discrete image tokenization, which poses a major challenge. Although several…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Ji Woo Hong , Hee Suk Yoon , Gwanhyeong Koo , Eunseop Yoon , SooHwan Eom , Qi Dai , Chong Luo , Chang D. Yoo

One critical prerequisite for faithful text-to-image generation is the accurate understanding of text inputs. Existing methods leverage the text encoder of the CLIP model to represent input prompts. However, the pre-trained CLIP model can…

Computer Vision and Pattern Recognition · Computer Science 2024-07-19 Zhiyu Tan , Mengping Yang , Luozheng Qin , Hao Yang , Ye Qian , Qiang Zhou , Cheng Zhang , Hao Li

This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language…

Computer Vision and Pattern Recognition · Computer Science 2025-05-16 Bingda Tang , Boyang Zheng , Xichen Pan , Sayak Paul , Saining Xie

Text-to-image generation has advanced rapidly with diffusion models, progressing from CLIP and T5 conditioning to unified systems where a single LLM backbone handles both visual understanding and generation. Despite the architectural…

Computer Vision and Pattern Recognition · Computer Science 2026-05-06 Sucheng Ren , Chen Chen , Zhenbang Wang , Liangchen Song , Xiangxin Zhu , Alan Yuille , Liang-Chieh Chen , Jiasen Lu

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Siqi Kou , Jiachun Jin , Zetong Zhou , Ye Ma , Yugang Wang , Quan Chen , Peng Jiang , Xiao Yang , Jun Zhu , Kai Yu , Zhijie Deng

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Hanan Gani , Shariq Farooq Bhat , Muzammal Naseer , Salman Khan , Peter Wonka

Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts. Despite the advancement, these diffusion models sometimes struggle to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Xiaohui Chen , Yongfei Liu , Yingxiang Yang , Jianbo Yuan , Quanzeng You , Li-Ping Liu , Hongxia Yang

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense…

Computation and Language · Computer Science 2023-11-30 Shanshan Zhong , Zhongzhan Huang , Wushao Wen , Jinghui Qin , Liang Lin

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit…

Computer Vision and Pattern Recognition · Computer Science 2025-02-26 Pengzhi Li , Pengfei Yu , Zide Liu , Wei He , Xuhao Pan , Xudong Rao , Tao Wei , Wei Chen

In the burgeoning field of natural language processing (NLP), Neural Topic Models (NTMs) , Large Language Models (LLMs) and Diffusion model have emerged as areas of significant research interest. Despite this, NTMs primarily utilize…

Computation and Language · Computer Science 2023-12-27 Weijie Xu , Wenxiang Hu , Fanyou Wu , Srinivasan Sengamedu

Text serves as the key control signal in video generation due to its narrative nature. To render text descriptions into video clips, current video diffusion models borrow features from text encoders yet struggle with limited text…

Computer Vision and Pattern Recognition · Computer Science 2024-12-05 Shuai Tan , Biao Gong , Yutong Feng , Kecheng Zheng , Dandan Zheng , Shuwei Shi , Yujun Shen , Jingdong Chen , Ming Yang

Latent diffusion models offer an attractive alternative to discrete diffusion for non-autoregressive text generation by operating on continuous text representations and denoising entire sequences in parallel. The major challenge in latent…

Computation and Language · Computer Science 2026-05-11 Viacheslav Meshchaninov , Alexander Shabalin , Egor Chimbulatov , Nikita Gushchin , Ilya Koziev , Alexander Korotin , Dmitry Vetrov
‹ Prev 1 2 3 10 Next ›