Related papers: RelationAdapter: Learning and Transferring Visual …

Edit Transfer: Learning Image Editing via Vision In-Context Relations

We introduce a new setting, Edit Transfer, where a model learns a transformation from just a single source-target example and applies it to a new query image. While text-based methods excel at semantic manipulations through textual prompts,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-02 Lan Chen , Qi Mao , Yuchao Gu , Mike Zheng Shou

ReVersion: Diffusion-Based Relation Inversion from Images

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images, and existing inversion methods mainly…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Ziqi Huang , Tianxing Wu , Yuming Jiang , Kelvin C. K. Chan , Ziwei Liu

Image-to-Image Translation with Diffusion Transformers and CLIP-Based Image Conditioning

Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qiang Zhu , Kuan Lu , Menghao Huo , Yuxiao Li

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Hu Ye , Jun Zhang , Sibo Liu , Xiao Han , Wei Yang

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

Recent advances in diffusion transformers have shown remarkable generalization in visual synthesis, yet most dense perception methods still rely on text-to-image (T2I) generators designed for stochastic generation. We revisit this paradigm…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Yiqing Shi , Yiren Song , Mike Zheng Shou

Textualize Visual Prompt for Image Editing via Diffusion Bridge

Visual prompt, a pair of before-and-after edited images, can convey indescribable imagery transformations and prosper in image editing. However, current visual prompt methods rely on a pretrained text-guided image-to-image generative model…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Pengcheng Xu , Qingnan Fan , Fei Kou , Shuai Qin , Hong Gu , Ruoyu Zhao , Charles Ling , Boyu Wang

EditTransfer++: Toward Faithful and Efficient Visual-Prompt-Guided Image Editing

Visual-prompt-guided edit transfer aims to learn image transformations directly from example pairs, offering more precise and controllable editing than purely text-driven approaches. However, existing diffusion transformer-based methods…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Lan Chen , Qi Mao , Yiren Song , Yuchao Gu , Siwei Ma

MotionAdapter: Video Motion Transfer via Content-Aware Attention Customization

Recent advances in diffusion-based text-to-video models, particularly those built on the diffusion transformer architecture, have achieved remarkable progress in generating high-quality and temporally coherent videos. However, transferring…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Zhexin Zhang , Yangyang Xu , Yifeng Zhu , Long Chen , Yong Du , Shengfeng He , Jun Yu

Exploring Multimodal Diffusion Transformers for Enhanced Prompt-based Image Editing

Transformer-based diffusion models have recently superseded traditional U-Net architectures, with multimodal diffusion transformers (MM-DiT) emerging as the dominant approach in state-of-the-art models like Stable Diffusion 3 and Flux.1.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Joonghyuk Shin , Alchan Hwang , Yujin Kim , Daneul Kim , Jaesik Park

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.They either finetune the model, or invert the image in the latent space of the pretrained model. However,…

Computer Vision and Pattern Recognition · Computer Science 2024-12-09 Senmao Li , Joost van de Weijer , Taihang Hu , Fahad Shahbaz Khan , Qibin Hou , Yaxing Wang , Jian Yang , Ming-Ming Cheng

Att-Adapter: A Robust and Precise Domain-Specific Multi-Attributes T2I Diffusion Adapter via Conditional Variational Autoencoder

Text-to-Image (T2I) Diffusion Models have achieved remarkable performance in generating high quality images. However, enabling precise control of continuous attributes, especially multiple attributes simultaneously, in a new domain (e.g.,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-25 Wonwoong Cho , Yan-Ying Chen , Matthew Klenk , David I. Inouye , Yanxia Zhang

DiT4Edit: Diffusion Transformer for Image Editing

Despite recent advances in UNet-based image editing, methods for shape-aware object editing in high-resolution images are still lacking. Compared to UNet, Diffusion Transformers (DiT) demonstrate superior capabilities to effectively capture…

Computer Vision and Pattern Recognition · Computer Science 2024-11-08 Kunyu Feng , Yue Ma , Bingyuan Wang , Chenyang Qi , Haozhe Chen , Qifeng Chen , Zeyu Wang

A training-free framework for high-fidelity appearance transfer via diffusion transformers

Diffusion Transformers (DiTs) excel at generation, but their global self-attention makes controllable, reference-image-based editing a distinct challenge. Unlike U-Nets, naively injecting local appearance into a DiT can disrupt its holistic…

Computer Vision and Pattern Recognition · Computer Science 2026-03-31 Shengrong Gu , Ye Wang , Song Wu , Rui Ma , Qian Wang , Lanjun Wang , Zili Yi

A Diffusion Model Translator for Efficient Image-to-Image Translation

Applying diffusion models to image-to-image translation (I2I) has recently received increasing attention due to its practical applications. Previous attempts inject information from the source image into each denoising step for an iterative…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Mengfei Xia , Yu Zhou , Ran Yi , Yong-Jin Liu , Wenping Wang

ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models

Recent advancement in text-to-image models (e.g., Stable Diffusion) and corresponding personalized technologies (e.g., DreamBooth and LoRA) enables individuals to generate high-quality and imaginative images. However, they often suffer from…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Jiaxiang Cheng , Pan Xie , Xin Xia , Jiashi Li , Jie Wu , Yuxi Ren , Huixia Li , Xuefeng Xiao , Min Zheng , Lean Fu

Harnessing Diffusion Models for Visual Perception with Meta Prompts

The issue of generative pretraining for vision models has persisted as a long-standing conundrum. At present, the text-to-image (T2I) diffusion model demonstrates remarkable proficiency in generating high-definition images matching textual…

Computer Vision and Pattern Recognition · Computer Science 2023-12-25 Qiang Wan , Zilong Huang , Bingyi Kang , Jiashi Feng , Li Zhang

TextLDM: Language Modeling with Continuous Latent Diffusion

Diffusion Transformers (DiT) trained with flow matching in a VAE latent space have unified visual generation across images and videos. A natural next step toward a single architecture for both generation (visual synthesis) and understanding…

Computation and Language · Computer Science 2026-05-11 Jiaxiu Jiang , Jingjing Ren , Wenbo Li , Bo Wang , Haoze Sun , Yijun Yang , Jianhui Liu , Yanbing Zhang , Shenghe Zheng , Yuan Zhang , Haoyang Huang , Nan Duan , Wangmeng Zuo

Towards Unified Semantic and Controllable Image Fusion: A Diffusion Transformer Approach

Image fusion aims to blend complementary information from multiple sensing modalities, yet existing approaches remain limited in robustness, adaptability, and controllability. Most current fusion networks are tailored to specific tasks and…

Computer Vision and Pattern Recognition · Computer Science 2025-12-09 Jiayang Li , Chengjie Jiang , Junjun Jiang , Pengwei Liang , Jiayi Ma , Liqiang Nie

Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing

Large-scale text-to-image generative models have been a ground-breaking development in generative AI, with diffusion models showing their astounding ability to synthesize convincing images following an input text prompt. The goal of image…

Computer Vision and Pattern Recognition · Computer Science 2023-09-28 Kai Wang , Fei Yang , Shiqi Yang , Muhammad Atif Butt , Joost van de Weijer

X2I: Seamless Integration of Multimodal Understanding into Diffusion Transformer via Attention Distillation

Text-to-image (T2I) models are well known for their ability to produce highly realistic images, while multimodal large language models (MLLMs) are renowned for their proficiency in understanding and integrating multiple modalities. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Jian Ma , Qirong Peng , Xu Guo , Chen Chen , Haonan Lu , Zhenyu Yang