Related papers: EasyControl: Adding Efficient and Flexible Control…

EasyControl: Transfer ControlNet to Video Diffusion for Controllable Generation and Interpolation

Following the advancements in text-guided image generation technology exemplified by Stable Diffusion, video generation is gaining increased attention in the academic community. However, relying solely on text guidance for video generation…

Computer Vision and Pattern Recognition · Computer Science 2024-09-17 Cong Wang , Jiaxi Gu , Panwen Hu , Haoyu Zhao , Yuanfan Guo , Jianhua Han , Hang Xu , Xiaodan Liang

UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer

With the rapid development of diffusion models in image generation, the demand for more powerful and flexible controllable frameworks is increasing. Although existing methods can guide generation beyond text prompts, the challenge of…

Computer Vision and Pattern Recognition · Computer Science 2025-07-09 Haoxuan Wang , Jinlong Peng , Qingdong He , Hao Yang , Ying Jin , Jiafu Wu , Xiaobin Hu , Yanjie Pan , Zhenye Gan , Mingmin Chi , Bo Peng , Yabiao Wang

OminiControl: Minimal and Universal Control for Diffusion Transformer

We present OminiControl, a novel approach that rethinks how image conditions are integrated into Diffusion Transformer (DiT) architectures. Current image conditioning methods either introduce substantial parameter overhead or handle only…

Computer Vision and Pattern Recognition · Computer Science 2025-07-08 Zhenxiong Tan , Songhua Liu , Xingyi Yang , Qiaochu Xue , Xinchao Wang

NanoControl: A Lightweight Framework for Precise and Efficient Control in Diffusion Transformer

Diffusion Transformers (DiTs) have demonstrated exceptional capabilities in text-to-image synthesis. However, in the domain of controllable text-to-image generation using DiTs, most existing methods still rely on the ControlNet paradigm…

Computer Vision and Pattern Recognition · Computer Science 2025-08-15 Shanyuan Liu , Jian Zhu , Junda Lu , Yue Gong , Liuzhuozheng Li , Bo Cheng , Yuhang Ma , Liebucha Wu , Xiaoyu Wu , Dawei Leng , Yuhui Yin

EasyText: Controllable Diffusion Transformer for Multilingual Text Rendering

Generating accurate multilingual text with diffusion models has long been desired but remains challenging. Recent methods have made progress in rendering text in a single language, but rendering arbitrary languages is still an unexplored…

Computer Vision and Pattern Recognition · Computer Science 2026-03-11 Runnan Lu , Yuxuan Zhang , Jiaming Liu , Haofan Wang , Yiren Song

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the…

Computer Vision and Pattern Recognition · Computer Science 2026-01-15 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Hao Luo , Yibing Song , Gao Huang , Fan Wang , Yang You

TC-LoRA: Temporally Modulated Conditional LoRA for Adaptive Diffusion Control

Current controllable diffusion models typically rely on fixed architectures that modify intermediate activations to inject guidance conditioned on a new modality. This approach uses a static conditioning strategy for a dynamic, multi-stage…

Computer Vision and Pattern Recognition · Computer Science 2025-12-16 Minkyoung Cho , Ruben Ohana , Christian Jacobsen , Adityan Jothi , Min-Hung Chen , Z. Morley Mao , Ethem Can

OminiControl2: Efficient Conditioning for Diffusion Transformers

Fine-grained control of text-to-image diffusion transformer models (DiT) remains a critical challenge for practical deployment. While recent advances such as OminiControl and others have enabled a controllable generation of diverse control…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Zhenxiong Tan , Qiaochu Xue , Xingyi Yang , Songhua Liu , Xinchao Wang

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Despite their remarkable performance, modern Diffusion Transformers are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we…

Machine Learning · Computer Science 2025-02-28 Sotiris Anagnostidis , Gregor Bachmann , Yeongmin Kim , Jonas Kohler , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Albert Pumarola , Ali Thabet , Edgar Schönfeld

FreeControl: Efficient, Training-Free Structural Control via One-Step Attention Extraction

Controlling the spatial and semantic structure of diffusion-generated images remains a challenge. Existing methods like ControlNet rely on handcrafted condition maps and retraining, limiting flexibility and generalization. Inversion-based…

Computer Vision and Pattern Recognition · Computer Science 2025-11-10 Jiang Lin , Xinyu Chen , Song Wu , Zhiqiu Zhang , Jizhi Zhang , Ye Wang , Qiang Tang , Qian Wang , Jian Yang , Zili Yi

In-Context LoRA for Diffusion Transformers

Recent research arXiv:2410.15027 has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-06 Lianghua Huang , Wei Wang , Zhi-Fan Wu , Yupeng Shi , Huanzhang Dou , Chen Liang , Yutong Feng , Yu Liu , Jingren Zhou

ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices

The Diffusion Transformer (DiT) architecture is the state-of-the-art paradigm for high-fidelity image generation, underpinning models like Stable Diffusion-3 and FLUX.1. However, deploying these models on resource-constrained mobile devices…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Kunpeng Du , Haizhen Xie , Sen Lu , Lei Yu , Binglei Bao , Huaao Tang , Chuntao Liu , Hao Wu , Yang Zhao , Zhicai Huang , Heyuan Gao , Zhijun Tu , Jie Hu , Xinghao Chen

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Zhuofan Zong , Dongzhi Jiang , Bingqi Ma , Guanglu Song , Hao Shao , Dazhong Shen , Yu Liu , Hongsheng Li

Dynamic Diffusion Transformer

Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Wangbo Zhao , Yizeng Han , Jiasheng Tang , Kai Wang , Yibing Song , Gao Huang , Fan Wang , Yang You

SoftCap: Soft-Budget Control for Diffusion Transformer Acceleration

Diffusion Transformers (DiTs) achieve strong visual quality, but their iterative denoising process requires many costly Transformer evaluations. Training-free acceleration methods reduce this cost by caching, forecasting, or verifying…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Yuhang Zhang , Junxiang Qiu , Huixia Ben , Zhenhua Tang , Shuo Wang , Yanbin Hao

FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which…

Computer Vision and Pattern Recognition · Computer Science 2025-06-06 Xuanhua He , Quande Liu , Zixuan Ye , Weicai Ye , Qiulin Wang , Xintao Wang , Qifeng Chen , Pengfei Wan , Di Zhang , Kun Gai

HarmoniCa: Harmonizing Training and Inference for Better Feature Caching in Diffusion Transformer Acceleration

Diffusion Transformers (DiTs) excel in generative tasks but face practical deployment challenges due to high inference costs. Feature caching, which stores and retrieves redundant computations, offers the potential for acceleration.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Yushi Huang , Zining Wang , Ruihao Gong , Jing Liu , Xinjie Zhang , Jinyang Guo , Xianglong Liu , Jun Zhang

RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer

Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a…

Computer Vision and Pattern Recognition · Computer Science 2025-09-29 Wangbo Zhao , Yizeng Han , Zhiwei Tang , Jiasheng Tang , Pengfei Zhou , Kai Wang , Bohan Zhuang , Zhangyang Wang , Fan Wang , Yang You

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

Diffusion transformers (DiT) have become the de facto choice for generating high-quality images and videos, largely due to their scalability, which enables the construction of larger models for enhanced performance. However, the increased…

Computer Vision and Pattern Recognition · Computer Science 2024-07-02 Pratheba Selvaraju , Tianyu Ding , Tianyi Chen , Ilya Zharkov , Luming Liang

ECNet: Effective Controllable Text-to-Image Diffusion Models

The conditional text-to-image diffusion models have garnered significant attention in recent years. However, the precision of these models is often compromised mainly for two reasons, ambiguous condition input and inadequate condition…

Computer Vision and Pattern Recognition · Computer Science 2024-03-28 Sicheng Li , Keqiang Sun , Zhixin Lai , Xiaoshi Wu , Feng Qiu , Haoran Xie , Kazunori Miyata , Hongsheng Li