Related papers: coDrawAgents: A Multi-Agent Dialogue Framework for…

Divide and Conquer: Language Models can Plan and Self-Correct for Compositional Text-to-Image Generation

Despite significant advancements in text-to-image models for generating high-quality images, these methods still struggle to ensure the controllability of text prompts over images in the context of complex text prompts, especially when it…

Computer Vision and Pattern Recognition · Computer Science 2024-01-31 Zhenyu Wang , Enze Xie , Aoxue Li , Zhongdao Wang , Xihui Liu , Zhenguo Li

CoAgent: Collaborative Planning and Consistency Agent for Coherent Video Generation

Maintaining narrative coherence and visual consistency remains a central challenge in open-domain video generation. Existing text-to-video models often treat each shot independently, resulting in identity drift, scene inconsistency, and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-30 Qinglin Zeng , Kaitong Cai , Ruiqi Chen , Qinhan Lv , Keze Wang

LayoutAgent: A Vision-Language Agent Guided Compositional Diffusion for Spatial Layout Planning

Designing realistic multi-object scenes requires not only generating images, but also planning spatial layouts that respect semantic relations and physical plausibility. On one hand, while recent advances in diffusion models have enabled…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zezhong Fan , Xiaohan Li , Luyi Ma , Kai Zhao , Liang Peng , Topojoy Biswas , Evren Korpeoglu , Kaushiki Nag , Kannan Achan

DialogueAgents: A Hybrid Agent-Based Speech Synthesis Framework for Multi-Party Dialogue

Speech synthesis is crucial for human-computer interaction, enabling natural and intuitive communication. However, existing datasets involve high construction costs due to manual annotation and suffer from limited character diversity,…

Computation and Language · Computer Science 2025-04-22 Xiang Li , Duyi Pan , Hongru Xiao , Jiale Han , Jing Tang , Jiabao Ma , Wei Wang , Bo Cheng

CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs

Effective prompt design is essential for improving the planning capabilities of large language model (LLM)-driven agents. However, existing structured prompting strategies are typically limited to single-agent, plan-only settings, and often…

Artificial Intelligence · Computer Science 2025-07-08 Bruce Yang , Xinfeng He , Huan Gao , Yifan Cao , Xiaofan Li , David Hsu

Teaching Text-to-Image Models to Communicate in Dialog

A picture is worth a thousand words, thus, it is crucial for conversational agents to understand, perceive, and effectively respond with pictures. However, we find that directly employing conventional image generation techniques is…

Computation and Language · Computer Science 2024-02-09 Xiaowen Sun , Jiazhan Feng , Yuxuan Wang , Yuxuan Lai , Xingyu Shen , Dongyan Zhao

Talk2Image: A Multi-Agent System for Multi-Turn Image Generation and Editing

Text-to-image generation tasks have driven remarkable advances in diverse media applications, yet most focus on single-turn scenarios and struggle with iterative, multi-turn creative tasks. Recent dialogue-based systems attempt to bridge…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Shichao Ma , Yunhe Guo , Jiahao Su , Qihe Huang , Zhengyang Zhou , Yang Wang

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

Document Question Answering (DocQA) is a very common task. Existing methods using Large Language Models (LLMs) or Large Vision Language Models (LVLMs) and Retrieval Augmented Generation (RAG) often prioritize information from a single…

Machine Learning · Computer Science 2025-03-19 Siwei Han , Peng Xia , Ruiyi Zhang , Tong Sun , Yun Li , Hongtu Zhu , Huaxiu Yao

Collaborative Text-to-Image Generation via Multi-Agent Reinforcement Learning and Semantic Fusion

Multimodal text-to-image generation remains constrained by the difficulty of maintaining semantic alignment and professional-level detail across diverse visual domains. We propose a multi-agent reinforcement learning framework that…

Artificial Intelligence · Computer Science 2025-10-14 Jiabao Shi , Minfeng Qi , Lefeng Zhang , Di Wang , Yingjie Zhao , Ziying Li , Yalong Xing , Ningran Li

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Kaixun Jiang , Yuzheng Wang , Junjie Zhou , Pandeng Li , Zhihang Liu , Chen-Wei Xie , Zhaoyu Chen , Yun Zheng , Wenqiang Zhang

PromptSculptor: Multi-Agent Based Text-to-Image Prompt Optimization

The rapid advancement of generative AI has democratized access to powerful tools such as Text-to-Image models. However, to generate high-quality images, users must still craft detailed prompts specifying scene, style, and context-often…

Multiagent Systems · Computer Science 2025-09-25 Dawei Xiang , Wenyan Xu , Kexin Chu , Tianqi Ding , Zixu Shen , Yiming Zeng , Jianchang Su , Wei Zhang

AutoAgent: Evolving Cognition and Elastic Memory Orchestration for Adaptive Agents

Autonomous agent frameworks still struggle to reconcile long-term experiential learning with real-time, context-sensitive decision-making. In practice, this gap appears as static cognition, rigid workflow dependence, and inefficient context…

Artificial Intelligence · Computer Science 2026-03-11 Xiaoxing Wang , Ning Liao , Shikun Wei , Chen Tang , Feiyu Xiong

AgentComp: From Agentic Reasoning to Compositional Mastery in Text-to-Image Models

Text-to-image generative models have achieved remarkable visual quality but still struggle with compositionality$-$accurately capturing object relationships, attribute bindings, and fine-grained details in prompts. A key limitation is that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Arman Zarei , Jiacheng Pan , Matthew Gwilliam , Soheil Feizi , Zhenheng Yang

PhotoAgent: A Robotic Photographer with Spatial and Aesthetic Understanding

Embodied agents for creative tasks like photography must bridge the semantic gap between high-level language commands and geometric control. We introduce PhotoAgent, an agent that achieves this by integrating Large Multimodal Models (LMMs)…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Lirong Che , Zhenfeng Gan , Yanbo Chen , Junbo Tan , Xueqian Wang

Intelligent Co-Design: An Interactive LLM Framework for Interior Spatial Design via Multi-Modal Agents

In architectural interior design, miscommunication frequently arises as clients lack design knowledge, while designers struggle to explain complex spatial relationships, leading to delayed timelines and financial losses. Recent advancements…

Artificial Intelligence · Computer Science 2026-03-17 Ren Jian Lim , Rushi Dai

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

Topology optimization can generate efficient structures, but designers often must manually translate qualitative intent, such as desired visual style, product experience, or manufacturability into solver settings that are not directly tied…

Artificial Intelligence · Computer Science 2026-05-22 Isabella A. Stewart , Hongrui Chen , Faez Ahmed

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Yiran Zhao , Yaoqi Ye , Xiang Liu , Michael Qizhe Shieh , Trung Bui

AutoAgents: A Framework for Automatic Agent Generation

Large language models (LLMs) have enabled remarkable advances in automated task-solving with multi-agent systems. However, most existing LLM-based multi-agent approaches rely on predefined agents to handle simple tasks, limiting the…

Artificial Intelligence · Computer Science 2024-05-01 Guangyao Chen , Siwei Dong , Yu Shu , Ge Zhang , Jaward Sesay , Börje F. Karlsson , Jie Fu , Yemin Shi

CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation

The rapid development of generative artificial intelligence (GenAI) presents new opportunities to advance the cartographic process. Previous studies have either overlooked the artistic aspects of maps or faced challenges in creating both…

Human-Computer Interaction · Computer Science 2025-05-16 Chenglong Wang , Yuhao Kang , Zhaoya Gong , Pengjun Zhao , Yu Feng , Wenjia Zhang , Ge Li

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Kaiyi Huang , Yukun Huang , Xuefei Ning , Zinan Lin , Yu Wang , Xihui Liu