MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Zihao Lin; Wanrong Zhu; Jiuxiang Gu; Jihyung Kil; Christopher Tensmeyer; Lin Zhang; Shilong Liu; Ruiyi Zhang; Lifu Huang; Vlad I. Morariu; Tong Sun

MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Computer Vision and Pattern Recognition 2026-01-30 v2

Authors: Zihao Lin , Wanrong Zhu , Jiuxiang Gu , Jihyung Kil , Christopher Tensmeyer , Lin Zhang , Shilong Liu , Ruiyi Zhang , Lifu Huang , Vlad I. Morariu , Tong Sun

View on arXiv ↗ PDF ↗

Abstract

Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

Keywords

image editing multi-agent reasoning generative design

Cite

@article{arxiv.2601.04589,
  title  = {MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing},
  author = {Zihao Lin and Wanrong Zhu and Jiuxiang Gu and Jihyung Kil and Christopher Tensmeyer and Lin Zhang and Shilong Liu and Ruiyi Zhang and Lifu Huang and Vlad I. Morariu and Tong Sun},
  journal= {arXiv preprint arXiv:2601.04589},
  year   = {2026}
}

MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Abstract

Keywords

Cite

Related papers