Related papers: LayoutAgent: A Vision-Language Agent Guided Compos…

DiffusionAgent: Navigating Expert Models for Agentic Image Generation

In the accelerating era of human-instructed visual content creation, diffusion models have demonstrated remarkable generative potential. Yet their deployment is constrained by a dual bottleneck: semantic ambiguity in diverse prompts and the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Jie Qin , Jie Wu , Weifeng Chen , Yueming Lyu

coDrawAgents: A Multi-Agent Dialogue Framework for Compositional Image Generation

Text-to-image generation has advanced rapidly, but existing models still struggle with faithfully composing multiple objects and preserving their attributes in complex scenes. We propose coDrawAgents, an interactive multi-agent dialogue…

Computer Vision and Pattern Recognition · Computer Science 2026-03-16 Chunhan Li , Qifeng Wu , Jia-Hui Pan , Ka-Hei Hui , Jingyu Hu , Yuming Jiang , Bin Sheng , Xihui Liu , Wenjuan Gong , Zhengzhe Liu

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Binbin Yang , Yi Luo , Ziliang Chen , Guangrun Wang , Xiaodan Liang , Liang Lin

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs)…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Lirong Che , Zhenfeng Gan , Yanbo Chen , Junbo Tan , Xueqian Wang

LayoutDiT: Exploring Content-Graphic Balance in Layout Generation with Diffusion Transformer

Layout generation is a foundation task of graphic design, which requires the integration of visual aesthetics and harmonious expression of content delivery. However, existing methods still face challenges in generating precise and visually…

Computer Vision and Pattern Recognition · Computer Science 2024-11-26 Yu Li , Yifan Chen , Gongye Liu , Fei Yin , Qingyan Bai , Jie Wu , Hongfa Wang , Ruihang Chu , Yujiu Yang

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhenyu Wang , Enze Xie , Aoxue Li , Zhongdao Wang , Xihui Liu , Zhenguo Li

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Kaixun Jiang , Yuzheng Wang , Junjie Zhou , Pandeng Li , Zhihang Liu , Chen-Wei Xie , Zhaoyu Chen , Yun Zheng , Wenqiang Zhang

Consistent Image Layout Editing with Diffusion Models

Despite the great success of large-scale text-to-image diffusion models in image generation and image editing, existing methods still struggle to edit the layout of real images. Although a few works have been proposed to tackle this…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Tao Xia , Yudi Zhang , Ting Liu Lei Zhang

MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration

Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and…

Artificial Intelligence · Computer Science 2025-10-15 Md Hasebul Hasan , Mahir Labib Dihan , Tanzima Hashem , Mohammed Eunus Ali , Md Rizwan Parvez

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

Layout-to-image generation refers to the task of synthesizing photo-realistic images based on semantic layouts. In this paper, we propose LayoutDiffuse that adapts a foundational diffusion model pretrained on large-scale image or text-image…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Jiaxin Cheng , Xiao Liang , Xingjian Shi , Tong He , Tianjun Xiao , Mu Li

Exploring Compositional Visual Generation with Latent Classifier Guidance

Diffusion probabilistic models have achieved enormous success in the field of image generation and manipulation. In this paper, we explore a novel paradigm of using the diffusion model and classifier guidance in the latent semantic space…

Computer Vision and Pattern Recognition · Computer Science 2023-05-25 Changhao Shi , Haomiao Ni , Kai Li , Shaobo Han , Mingfu Liang , Martin Renqiang Min

Layout Agnostic Scene Text Image Synthesis with Diffusion Models

While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Qilong Zhangli , Jindong Jiang , Di Liu , Licheng Yu , Xiaoliang Dai , Ankit Ramchandani , Guan Pang , Dimitris N. Metaxas , Praveen Krishnan

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-13 Guangcong Zheng , Xianpan Zhou , Xuewei Li , Zhongang Qi , Ying Shan , Xi Li

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

In this paper, we propose a unified layout planning and image generation model, PlanGen, which can pre-plan spatial layout conditions before generating images. Unlike previous diffusion-based models that treat layout planning and…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Runze He , Bo Cheng , Yuhang Ma , Qingxiang Jia , Shanyuan Liu , Ao Ma , Xiaoyu Wu , Liebucha Wu , Dawei Leng , Yuhui Yin

CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Qinglin Zeng , Kaitong Cai , Ruiqi Chen , Qinhan Lv , Keze Wang

RoboLayout: Differentiable 3D Scene Generation for Embodied Agents

Recent advances in vision language models (VLMs) have shown strong potential for spatial reasoning and 3D scene layout generation from open-ended language instructions. However, generating layouts that are not only semantically coherent but…

Artificial Intelligence · Computer Science 2026-03-10 Ali Shamsaddinlou

SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches,…

Artificial Intelligence · Computer Science 2026-03-03 Furong Jia , Ling Dai , Wenjin Deng , Fan Zhang , Chen Hu , Daxin Jiang , Yu Liu

Unifying Layout Generation with a Decoupled Diffusion Model

Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty…

Computer Vision and Pattern Recognition · Computer Science 2023-03-10 Mude Hui , Zhizheng Zhang , Xiaoyi Zhang , Wenxuan Xie , Yuwang Wang , Yan Lu

DreamLayer: Simultaneous Multi-Layer Generation via Diffusion Mode

Text-driven image generation using diffusion models has recently gained significant attention. To enable more flexible image manipulation and editing, recent research has expanded from single image generation to transparent layer generation…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Junjia Huang , Pengxiang Yan , Jinhang Cai , Jiyang Liu , Zhao Wang , Yitong Wang , Xinglong Wu , Guanbin Li

DriveAgent: Multi-Agent Structured Reasoning with LLM and Multimodal Sensor Fusion for Autonomous Driving

We introduce DriveAgent, a novel multi-agent autonomous driving framework that leverages large language model (LLM) reasoning combined with multimodal sensor fusion to enhance situational understanding and decision-making. DriveAgent…

Robotics · Computer Science 2025-05-06 Xinmeng Hou , Wuqi Wang , Long Yang , Hao Lin , Jinglun Feng , Haigen Min , Xiangmo Zhao