Related papers: MiLDEdit: Reasoning-Based Multi-Layer Design Docum…

TextEditBench: Evaluating Reasoning-aware Text Editing Beyond Rendering

Text rendering has recently emerged as one of the most challenging frontiers in visual generation, drawing significant attention from large-scale diffusion and multimodal models. However, text editing within images remains largely…

Computer Vision and Pattern Recognition · Computer Science 2025-12-19 Rui Gui , Yang Wan , Haochen Han , Dongxing Mao , Fangming Liu , Min Li , Alex Jinpeng Wang

Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing

Large Multi-modality Models (LMMs) have made significant progress in visual understanding and generation, but they still face challenges in General Visual Editing, particularly in following complex instructions, preserving appearance…

Computer Vision and Pattern Recognition · Computer Science 2025-05-28 Xiangyu Zhao , Peiyuan Zhang , Kexian Tang , Xiaorong Zhu , Hao Li , Wenhao Chai , Zicheng Zhang , Renqiu Xia , Guangtao Zhai , Junchi Yan , Hua Yang , Xue Yang , Haodong Duan

ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Yiran Zhao , Yaoqi Ye , Xiang Liu , Michael Qizhe Shieh , Trung Bui

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves…

Computer Vision and Pattern Recognition · Computer Science 2025-11-03 Yijia Wang , Yiqing Shen , Weiming Chen , Zhihai He

Towards Robust Real-World Spreadsheet Understanding with Multi-Agent Multi-Format Reasoning

Spreadsheets are central to real-world applications such as enterprise reporting, auditing, and scientific data management. Despite their ubiquity, existing large language model based approaches typically treat tables as plain text,…

Computation and Language · Computer Science 2026-04-15 Houxing Ren , Mingjie Zhan , Zimu Lu , Ke Wang , Yunqiao Yang , Haotian Hou , Hongsheng Li

An LLM-LVLM Driven Agent for Iterative and Fine-Grained Image Editing

Despite the remarkable capabilities of text-to-image (T2I) generation models, real-world applications often demand fine-grained, iterative image editing that existing methods struggle to provide. Key challenges include granular instruction…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Zihan Liang , Jiahao Sun , Haoran Ma

ReasonEdit: Towards Reasoning-Enhanced Image Editing Models

Recent advances in image editing models have shown remarkable progress. A common architectural design couples a multimodal large language model (MLLM) encoder with a diffusion decoder, as seen in systems such as Step1X-Edit and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-02 Fukun Yin , Shiyu Liu , Yucheng Han , Zhibo Wang , Peng Xing , Rui Wang , Wei Cheng , Yingming Wang , Aojie Li , Zixin Yin , Pengtao Chen , Xiangyu Zhang , Daxin Jiang , Xianfang Zeng , Gang Yu

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Feng Han , Yibin Wang , Chenglin Li , Zheming Liang , Dianyi Wang , Yang Jiao , Zhipeng Wei , Chao Gong , Cheng Jin , Jingjing Chen , Jiaqi Wang

InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Zhiqiang Sheng , Xumeng Han , Zhiwei Zhang , Zenghui Xiong , Yifan Ding , Aoxiang Ping , Xiang Li , Tong Guo , Yao Mao

MIND-Edit: MLLM Insight-Driven Editing via Language-Vision Projection

Recent advances in AI-generated content (AIGC) have significantly accelerated image editing techniques, driving increasing demand for diverse and fine-grained edits. Despite these advances, existing image editing methods still face…

Computer Vision and Pattern Recognition · Computer Science 2025-05-27 Shuyu Wang , Weiqi Li , Qian Wang , Shijie Zhao , Jian Zhang

ReasonEdit: Editing Vision-Language Models using Human Reasoning

Model editing aims to correct errors in large, pretrained models without altering unrelated behaviors. While some recent works have edited vision-language models (VLMs), no existing editors tackle reasoning-heavy tasks, which typically…

Computer Vision and Pattern Recognition · Computer Science 2026-05-13 Jiaxing Qiu , Kaihua Hou , Roxana Daneshjou , Ahmed Alaa , Thomas Hartvigsen

RECODE: Reasoning Through Code Generation for Visual Question Answering

Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering --…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Junhong Shen , Mu Cai , Bo Hu , Ameet Talwalkar , David A Ross , Cordelia Schmid , Alireza Fathi

MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance

Natural language processing evaluation has made significant progress, largely driven by the proliferation of powerful large language mod-els (LLMs). New evaluation benchmarks are of increasing priority as the reasoning capabilities of LLMs…

Computation and Language · Computer Science 2025-06-19 Joseph J. Peper , Wenzhao Qiu , Ali Payani , Lu Wang

PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images

Structured images (e.g., charts and geometric diagrams) remain challenging for multimodal large language models (MLLMs), as perceptual slips can cascade into erroneous conclusions. Intermediate visual cues can steer reasoning; however,…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Shuoshuo Zhang , Zijian Li , Yizhen Zhang , Jingjing Fu , Lei Song , Jiang Bian , Jun Zhang , Yujiu Yang , Rui Wang

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research…

Computer Vision and Pattern Recognition · Computer Science 2026-05-22 Hongzhu Yi , Yujia Yang , Yuanxiang Wang , Tong Li , Zhenyu Guan , Tianyu Zong , Jiahuan Chen , Chenxi Bao , Tiankun Yang , Haopeng Jin , Yixuan Yuan , Xinming Wang , Tao Yu , Ruilin Gao , Ruiwen Tao , Haijin Liang , Jin Ma , Jinwen Luo , Yeshani , Xinyu Zuo , Jungang Xu

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Yueru Jia , Yuhui Yuan , Aosong Cheng , Chuke Wang , Ji Li , Huizhu Jia , Shanghang Zhang

SMART-Editor: A Multi-Agent Framework for Human-Like Design Editing with Structural Integrity

We present SMART-Editor, a framework for compositional layout and content editing across structured (posters, websites) and unstructured (natural images) domains. Unlike prior models that perform local edits, SMART-Editor preserves global…

Computation and Language · Computer Science 2025-08-06 Ishani Mondal , Meera Bharadwaj , Ayush Roy , Aparna Garimella , Jordan Lee Boyd-Graber

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

Current instruction-based editing methods, such as InstructPix2Pix, often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this, this paper…

Computer Vision and Pattern Recognition · Computer Science 2023-12-13 Yuzhou Huang , Liangbin Xie , Xintao Wang , Ziyang Yuan , Xiaodong Cun , Yixiao Ge , Jiantao Zhou , Chao Dong , Rui Huang , Ruimao Zhang , Ying Shan

PosterCopilot: Toward Layout Reasoning and Controllable Editing for Professional Graphic Design

Graphic design forms the cornerstone of modern visual communication, serving as a vital medium for promoting cultural and commercial events. Recent advances have explored automating this process using Large Multimodal Models (LMMs), yet…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Jiazhe Wei , Ken Li , Tianyu Lao , Haofan Wang , Liang Wang , Caifeng Shan , Chenyang Si

ScaleEdit-12M: Scaling Open-Source Image Editing Data Generation via Multi-Agent Framework

Instruction-based image editing has emerged as a key capability for unified multimodal models (UMMs), yet constructing large-scale, diverse, and high-quality editing datasets without costly proprietary APIs remains challenging. Previous…

Computer Vision and Pattern Recognition · Computer Science 2026-03-25 Guanzhou Chen , Erfei Cui , Changyao Tian , Danni Yang , Ganlin Yang , Yu Qiao , Hongsheng Li , Gen Luo , Hongjie Zhang