Related papers: Diffusion Transformers with Representation Autoenc…

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high…

Computer Vision and Pattern Recognition · Computer Science 2026-01-14 Dongxu Liu , Jiahui Zhu , Yuang Peng , Haomiao Tang , Yuwei Chen , Chunrui Han , Zheng Ge , Daxin Jiang , Mingxue Liao

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Representation Autoencoders (RAEs) have shown distinct advantages in diffusion modeling on ImageNet by training in high-dimensional semantic latent spaces. In this work, we investigate whether this framework can scale to large-scale,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Shengbang Tong , Boyang Zheng , Ziteng Wang , Bingda Tang , Nanye Ma , Ellis Brown , Jihan Yang , Rob Fergus , Yann LeCun , Saining Xie

One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation

Visual generative models (e.g., diffusion models) typically operate in compressed latent spaces to balance training efficiency and sample quality. In parallel, there has been growing interest in leveraging high-quality pre-trained visual…

Computer Vision and Pattern Recognition · Computer Science 2025-12-17 Yuan Gao , Chen Chen , Tianrong Chen , Jiatao Gu

RAE-AR: Taming Autoregressive Models with Representation Autoencoders

The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE…

Artificial Intelligence · Computer Science 2026-04-03 Hu Yu , Hang Xu , Jie Huang , Zeyue Xue , Haoyang Huang , Nan Duan , Feng Zhao

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models

Latent diffusion models with Transformer architectures excel at generating high-fidelity images. However, recent studies reveal an optimization dilemma in this two-stage design: while increasing the per-token feature dimension in visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-11 Jingfeng Yao , Bin Yang , Xinggang Wang

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a new…

Machine Learning · Computer Science 2025-01-22 Seyedmorteza Sadat , Jakob Buhmann , Derek Bradley , Otmar Hilliges , Romann M. Weber

Exploring Representation-Aligned Latent Space for Better Generation

Generative models serve as powerful tools for modeling the real world, with mainstream diffusion models, particularly those based on the latent diffusion model paradigm, achieving remarkable progress across various tasks, such as image and…

Machine Learning · Computer Science 2025-02-04 Wanghan Xu , Xiaoyu Yue , Zidong Wang , Yao Teng , Wenlong Zhang , Xihui Liu , Luping Zhou , Wanli Ouyang , Lei Bai

Denoising Vision Transformer Autoencoder with Spectral Self-Regularization

Variational autoencoders (VAEs) typically encode images into a compact latent space, reducing computational cost but introducing an optimization dilemma: a higher-dimensional latent space improves reconstruction fidelity but often hampers…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Xunzhi Xiang , Xingye Tian , Guiyu Zhang , Yabo Chen , Shaofeng Zhang , Xuebo Wang , Xin Tao , Qi Fan

Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

Latent diffusion models for medical image super-resolution universally inherit variational autoencoders designed for natural photographs. We show that this default choice, not the diffusion architecture, is the dominant constraint on…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Sebastian Cajas , Ashaba Judith , Rahul Gorijavolu , Sahil Kapadia , Hillary Clinton Kasimbazi , Leo Kinyera , Emmanuel Paul Kwesiga , Sri Sri Jaithra Varma Manthena , Luis Filipe Nakayama , Ninsiima Doreen , Leo Anthony Celi

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

Diffusion probabilistic models (DPMs) have shown remarkable results on various image synthesis tasks such as text-to-image generation and image inpainting. However, compared to other generative methods like VAEs and GANs, DPMs lack a…

Computer Vision and Pattern Recognition · Computer Science 2023-07-13 Yipeng Leng , Qiangjuan Huang , Zhiyuan Wang , Yangyang Liu , Haoyu Zhang

RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Yue Gong , Hongyu Li , Shanyuan Liu , Bo Cheng , Yuhang Ma , Liebucha Wu , Xiaoyu Wu , Manyuan Zhang , Dawei Leng , Yuhui Yin , Lijun Zhang

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that…

Computer Vision and Pattern Recognition · Computer Science 2023-08-22 Weilai Xiang , Hongyu Yang , Di Huang , Yunhong Wang

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve reconstruction fidelity or inherit pretrained…

Computer Vision and Pattern Recognition · Computer Science 2026-05-11 Zhengrong Yue , Taihang Hu , Mengting Chen , Haiyu Zhang , Zihao Pan , Tao Liu , Zikang Wang , Jinsong Lan , Xiaoyong Zhu , Bo Zheng , Yali Wang

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network…

Computer Vision and Pattern Recognition · Computer Science 2025-10-02 Yushu Wu , Yanyu Li , Ivan Skorokhodov , Anil Kag , Willi Menapace , Sharath Girish , Aliaksandr Siarohin , Yanzhi Wang , Sergey Tulyakov

Geometric Autoencoder for Diffusion Models

Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Hangyu Liu , Jianyong Wang , Yutao Sun

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

Diffusion-based representation learning has achieved substantial attention due to its promising capabilities in latent representation and sample generation. Recent studies have employed an auxiliary encoder to identify a corresponding…

Machine Learning · Computer Science 2025-03-11 Yeongmin Kim , Kwanghyeon Lee , Minsang Park , Byeonghu Na , Il-Chul Moon

Revisiting Diffusion Autoencoder Training for Image Reconstruction Quality

Diffusion autoencoders (DAEs) are typically formulated as a noise prediction model and trained with a linear-$\beta$ noise schedule that spends much of its sampling steps at high noise levels. Because high noise levels are associated with…

Computer Vision and Pattern Recognition · Computer Science 2025-05-01 Pramook Khungurn , Sukit Seripanitkarn , Phonphrm Thawatdamrongkit , Supasorn Suwajanakorn

On Designing Diffusion Autoencoders for Efficient Generation and Representation Learning

Diffusion autoencoders (DAs) are variants of diffusion generative models that use an input-dependent latent variable to capture representations alongside the diffusion process. These representations, to varying extents, can be used for…

Machine Learning · Computer Science 2025-06-03 Magdalena Proszewska , Nikolay Malkin , N. Siddharth

Latent-Compressed Variational Autoencoder for Video Diffusion Models

Video variational autoencoders (VAEs) used in latent diffusion models typically require a sufficiently large number of latent channels to ensure high-quality video reconstruction. However, recent studies have revealed that an excessive…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Jiarui Guan , Wenshuai Zhao , Zhengtao Zou , Juho Kannala , Arno Solin

Improving Reconstruction of Representation Autoencoder

Recent work leverages Vision Foundation Models as image encoders to boost the generative performance of latent diffusion models (LDMs), as their semantic feature distributions are easy to learn. However, such semantic features often lack…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Siyu Liu , Chujie Qin , Hubery Yin , Qixin Yan , Zheng-Peng Duan , Chen Li , Jing Lyu , Chun-Le Guo , Chongyi Li