Related papers: GenClaw: Code-Driven Agentic Image Generation

GenAgent: Scaling Text-to-Image Generation via Agentic Multimodal Reasoning

We introduce GenAgent, unifying visual understanding and generation through an agentic multimodal model. Unlike unified models that face expensive training costs and understanding-generation trade-offs, GenAgent decouples these capabilities…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Kaixun Jiang , Yuzheng Wang , Junjie Zhou , Pandeng Li , Zhihang Liu , Chen-Wei Xie , Zhaoyu Chen , Yun Zheng , Wenqiang Zhang

GenFlow: Interactive Modular System for Image Generation

Generative art unlocks boundless creative possibilities, yet its full potential remains untapped due to the technical expertise required for advanced architectural concepts and computational workflows. To bridge this gap, we present…

Computer Vision and Pattern Recognition · Computer Science 2025-09-11 Duc-Hung Nguyen , Huu-Phuc Huynh , Minh-Triet Tran , Trung-Nghia Le

GenArtist: Multimodal LLM as an Agent for Unified Image Generation and Editing

Despite the success achieved by existing image generation and editing methods, current models still struggle with complex problems including intricate text prompts, and the absence of verification and self-correction mechanisms makes the…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Zhenyu Wang , Aoxue Li , Zhenguo Li , Xihui Liu

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration

Text-to-video generation models have shown significant progress in the recent years. However, they still struggle with generating complex dynamic scenes based on compositional text prompts, such as attribute binding for multiple objects,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Kaiyi Huang , Yukun Huang , Xuefei Ning , Zinan Lin , Yu Wang , Xihui Liu

Agentic Video Generation: From Text to Executable Event Graphs via Tool-Constrained LLM Planning

Existing multi-agent video generation systems use LLM agents to orchestrate neural video generators, producing visually impressive but semantically unreliable outputs with no ground truth annotations. We present an agentic system that…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Nicolae Cudlenco , Mihai Masala , Marius Leordeanu

Navigating the GAN Parameter Space for Semantic Image Editing

Generative Adversarial Networks (GANs) are currently an indispensable tool for visual editing, being a standard component of image-to-image translation and image restoration pipelines. Furthermore, GANs are especially useful for…

Machine Learning · Computer Science 2021-04-22 Anton Cherepkov , Andrey Voynov , Artem Babenko

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal…

Computer Vision and Pattern Recognition · Computer Science 2026-05-01 Keming Wu , Zuhao Yang , Kaichen Zhang , Shizun Wang , Haowei Zhu , Sicong Leng , Zhongyu Yang , Qijie Wang , Sudong Wang , Ziting Wang , Zili Wang , Hui Zhang , Haonan Wang , Hang Zhou , Yifan Pu , Xingxuan Li , Fangneng Zhan , Bo Li , Lidong Bing , Yuxin Song , Ziwei Liu , Wenhu Chen , Jingdong Wang , Xinchao Wang , Xiaojuan Qi , Shijian Lu , Bin Wang

Semantic Draw Engineering for Text-to-Image Creation

Text-to-image generation is conducted through Generative Adversarial Networks (GANs) or transformer models. However, the current challenge lies in accurately generating images based on textual descriptions, especially in scenarios where the…

Human-Computer Interaction · Computer Science 2024-01-10 Yang Li , Huaqiang Jiang , Yangkai Wu

GenCA: A Text-conditioned Generative Model for Realistic and Drivable Codec Avatars

Photo-realistic and controllable 3D avatars are crucial for various applications such as virtual and mixed reality (VR/MR), telepresence, gaming, and film production. Traditional methods for avatar creation often involve time-consuming…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Keqiang Sun , Amin Jourabloo , Riddhish Bhalodia , Moustafa Meshry , Yu Rong , Zhengyu Yang , Thu Nguyen-Phuoc , Christian Haene , Jiu Xu , Sam Johnson , Hongsheng Li , Sofien Bouaziz

MatClaw: An Autonomous Code-First LLM Agent for End-to-End Materials Exploration

Existing LLM agents for computational materials science are constrained by pipeline-bounded architectures tied to specific simulation codes and by dependence on manually written tool functions that grow with task scope. We present MatClaw,…

Materials Science · Physics 2026-05-25 Chenmu Zhang , Boris I. Yakobson

A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications

Large language model (LLM)-based agents that reason, plan, and act through tools, memory, and structured interaction are emerging as a promising paradigm for automating complex workflows. Recent systems such as OpenClaw and Claude Code…

Information Retrieval · Computer Science 2026-05-27 Yingli Zhou , Wang Shu , Yaodong Su , Wenchuan Du , Yixiang Fang , Xuemin Lin

Agentic Visualization: Extracting Agent-based Design Patterns from Visualization Systems

Autonomous agents powered by Large Language Models are transforming AI, creating an imperative for the visualization field to embrace agentic frameworks. However, our field's focus on a human in the sensemaking loop raises critical…

Human-Computer Interaction · Computer Science 2025-09-17 Vaishali Dhanoa , Anton Wolter , Gabriela Molina León , Hans-Jörg Schulz , Niklas Elmqvist

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control…

Computer Vision and Pattern Recognition · Computer Science 2025-02-28 Liang Chen , Shuai Bai , Wenhao Chai , Weichu Xie , Haozhe Zhao , Leon Vinci , Junyang Lin , Baobao Chang

CADDesigner: Conceptual CAD Model Generation with a General-Purpose Agent

Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present…

Artificial Intelligence · Computer Science 2026-05-20 Fengxiao Fan , Jingzhe Ni , Xiaolong Yin , Sirui Wang , Xingyu Lu , Qiang Zou , Ruofeng Tong , Min Tang , Peng Du

Mind-Brush: Integrating Agentic Cognitive Search and Reasoning into Image Generation

While text-to-image generation has achieved unprecedented fidelity, the vast majority of existing models function fundamentally as static text-to-pixel decoders. Consequently, they often fail to grasp implicit user intentions. Although…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Jun He , Junyan Ye , Zilong Huang , Dongzhi Jiang , Chenjue Zhang , Leqi Zhu , Renrui Zhang , Xiang Zhang , Weijia Li

Controlling generative models with continuous factors of variations

Recent deep generative models are able to provide photo-realistic images as well as visual or textual content embeddings useful to address various tasks of computer vision and natural language processing. Their usefulness is nevertheless…

Machine Learning · Computer Science 2020-01-29 Antoine Plumerault , Hervé Le Borgne , Céline Hudelot

Towards AGI A Pragmatic Approach Towards Self Evolving Agent

Large Language Model (LLM) based agents are powerful yet fundamentally static after deployment, lacking the ability to autonomously expand capabilities, generate new tools, or evolve their reasoning. This work introduces a hierarchical…

Computation and Language · Computer Science 2026-01-21 Indrajit Kar , Sammy Zonunpuia , Zonunfeli Ralte

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yilei Jiang , Yaozhi Zheng , Yuxuan Wan , Jiaming Han , Qunzhong Wang , Michael R. Lyu , Xiangyu Yue

Generative AI and the Transformation of Software Development Practices

Generative AI is reshaping how software is designed, written, and maintained. Advances in large language models (LLMs) are enabling new development styles - from chat-oriented programming and 'vibe coding' to agentic programming - that can…

Software Engineering · Computer Science 2025-10-14 Vivek Acharya

Cluster-guided Image Synthesis with Unconditional Models

Generative Adversarial Networks (GANs) are the driving force behind the state-of-the-art in image generation. Despite their ability to synthesize high-resolution photo-realistic images, generating content with on-demand conditioning of…

Computer Vision and Pattern Recognition · Computer Science 2021-12-28 Markos Georgopoulos , James Oldfield , Grigorios G Chrysos , Yannis Panagakis