Related papers: MCCD: Multi-Agent Collaboration-based Compositiona…

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Recent advances in text-to-image diffusion models have demonstrated impressive capabilities in image quality. However, complex scene generation remains relatively unexplored, and even the definition of `complex scene' itself remains…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Minghao Liu , Le Zhang , Yingjie Tian , Xiaochao Qu , Luoqi Liu , Ting Liu

Generating Intermediate Representations for Compositional Text-To-Image Generation

Text-to-image diffusion models have demonstrated an impressive ability to produce high-quality outputs. However, they often struggle to accurately follow fine-grained spatial information in an input text. To this end, we propose a…

Computer Vision and Pattern Recognition · Computer Science 2024-10-22 Ran Galun , Sagie Benaim

Mixture of Diffusers for scene composition and high resolution image generation

Diffusion methods have been proven to be very effective to generate images while conditioning on a text prompt. However, and although the quality of the generated images is unprecedented, these methods seem to struggle when trying to…

Computer Vision and Pattern Recognition · Computer Science 2023-02-07 Álvaro Barbero Jiménez

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Diffusion models have exhibit exceptional performance in text-to-image generation and editing. However, existing methods often face challenges when handling complex text prompts that involve multiple objects with multiple attributes and…

Computer Vision and Pattern Recognition · Computer Science 2024-06-05 Ling Yang , Zhaochen Yu , Chenlin Meng , Minkai Xu , Stefano Ermon , Bin Cui

Compositional 3D Scene Generation using Locally Conditioned Diffusion

Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level…

Computer Vision and Pattern Recognition · Computer Science 2023-03-24 Ryan Po , Gordon Wetzstein

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Alessandro Fontanella , Petru-Daniel Tudosiu , Yongxin Yang , Shifeng Zhang , Sarah Parisot

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

We argue that diffusion models' success in modeling complex distributions is, for the most part, coming from their input conditioning. This paper investigates the representation used to condition diffusion models from the perspective that…

Computer Vision and Pattern Recognition · Computer Science 2026-01-07 Samuel Lavoie , Michael Noukhovitch , Aaron Courville

Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models

Text-to-image generation has witnessed significant advancements with the integration of Large Vision-Language Models (LVLMs), yet challenges remain in aligning complex textual descriptions with high-quality, visually coherent images. This…

Computer Vision and Pattern Recognition · Computer Science 2025-01-03 Emily Johnson , Noah Wilson

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Kaiyi Huang , Yukun Huang , Xuefei Ning , Zinan Lin , Yu Wang , Xihui Liu

Multi-focal Conditioned Latent Diffusion for Person Image Synthesis

The Latent Diffusion Model (LDM) has demonstrated strong capabilities in high-resolution image generation and has been widely employed for Pose-Guided Person Image Synthesis (PGPIS), yielding promising results. However, the compression…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Jiaqi Liu , Jichao Zhang , Paolo Rota , Nicu Sebe

Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

Diffusion models have recently gained recognition for generating diverse and high-quality content, especially in image synthesis. These models excel not only in creating fixed-size images but also in producing panoramic images. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Xiaoyu Zhang , Teng Zhou , Xinlong Zhang , Jia Wei , Yongchuan Tang

Multi-Concept Customization of Text-to-Image Diffusion

While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Nupur Kumari , Bingliang Zhang , Richard Zhang , Eli Shechtman , Jun-Yan Zhu

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

Recent advances in text-to-image generation with diffusion models present transformative capabilities in image quality. However, user controllability of the generated image, and fast adaptation to new tasks still remains an open challenge,…

Computer Vision and Pattern Recognition · Computer Science 2023-02-17 Omer Bar-Tal , Lior Yariv , Yaron Lipman , Tali Dekel

Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that…

Artificial Intelligence · Computer Science 2025-10-14 Jiabao Shi , Minfeng Qi , Lefeng Zhang , Di Wang , Yingjie Zhao , Ziying Li , Yalong Xing , Ningran Li

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Lingjun Zhang , Xinyuan Chen , Yaohui Wang , Yue Lu , Yu Qiao

NeuralField-LDM: Scene Generation with Hierarchical Latent Diffusion Models

Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing…

Computer Vision and Pattern Recognition · Computer Science 2023-04-20 Seung Wook Kim , Bradley Brown , Kangxue Yin , Karsten Kreis , Katja Schwarz , Daiqing Li , Robin Rombach , Antonio Torralba , Sanja Fidler

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Yusuf Dalva , Guocheng Gordon Qian , Maya Goldenberg , Tsai-Shien Chen , Kfir Aberman , Sergey Tulyakov , Pinar Yanardag , Kuan-Chieh Jackson Wang

A Diffusion-based Method for Multi-turn Compositional Image Generation

Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial…

Computer Vision and Pattern Recognition · Computer Science 2023-11-15 Chao Wang

Enhancing Scene Text Detectors with Realistic Text Image Synthesis Using Diffusion Models

Scene text detection techniques have garnered significant attention due to their wide-ranging applications. However, existing methods have a high demand for training data, and obtaining accurate human annotations is labor-intensive and…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Ling Fu , Zijie Wu , Yingying Zhu , Yuliang Liu , Xiang Bai

Composite Diffusion | whole >= \Sigma parts

For an artist or a graphic designer, the spatial layout of a scene is a critical design choice. However, existing text-to-image diffusion models provide limited support for incorporating spatial information. This paper introduces Composite…

Computer Vision and Pattern Recognition · Computer Science 2023-07-27 Vikram Jamwal , Ramaneswaran S