Related papers: Pixel-Aligned Multi-View Generation with Depth Gui…

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or…

Computer Vision and Pattern Recognition · Computer Science 2024-07-30 Lukas Höllein , Aljaž Božič , Norman Müller , David Novotny , Hung-Yu Tseng , Christian Richardt , Michael Zollhöfer , Matthias Nießner

3D-aware Image Generation using 2D Diffusion Models

In this paper, we introduce a novel 3D-aware image generation method that leverages 2D diffusion models. We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential…

Computer Vision and Pattern Recognition · Computer Science 2023-04-03 Jianfeng Xiang , Jiaolong Yang , Binbin Huang , Xin Tong

Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

We introduce a diffusion-based framework that performs aligned novel view image and geometry generation via a warping-and-inpainting methodology. Unlike prior methods that require dense posed images or pose-embedded generative models…

Computer Vision and Pattern Recognition · Computer Science 2026-02-09 Min-Seop Kwak , Junho Kim , Sangdoo Yun , Dongyoon Han , Taekyung Kim , Seungryong Kim , Jin-Hwa Kim

Rethinking and Improving Natural Language Generation with Layer-Wise Multi-View Decoding

In sequence-to-sequence learning, e.g., natural language generation, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last…

Computation and Language · Computer Science 2022-08-30 Fenglin Liu , Xuancheng Ren , Guangxiang Zhao , Chenyu You , Xuewei Ma , Xian Wu , Xu Sun

VUGEN: Visual Understanding priors for GENeration

Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on…

Computer Vision and Pattern Recognition · Computer Science 2025-10-09 Xiangyi Chen , Théophane Vallaeys , Maha Elbayad , John Nguyen , Jakob Verbeek

Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

Modern Latent Diffusion Models (LDMs) typically operate in low-level Variational Autoencoder (VAE) latent spaces that are primarily optimized for pixel-level reconstruction. To unify vision generation and understanding, a burgeoning trend…

Computer Vision and Pattern Recognition · Computer Science 2025-12-22 Shilong Zhang , He Zhang , Zhifei Zhang , Chongjian Ge , Shuchen Xue , Shaoteng Liu , Mengwei Ren , Soo Ye Kim , Yuqian Zhou , Qing Liu , Daniil Pakhomov , Kai Zhang , Zhe Lin , Ping Luo

Novel View Synthesis with Pixel-Space Diffusion Models

Synthesizing a novel view from a single input image is a challenging task. Traditionally, this task was approached by estimating scene depth, warping, and inpainting, with machine learning models enabling parts of the pipeline. More…

Computer Vision and Pattern Recognition · Computer Science 2024-11-13 Noam Elata , Bahjat Kawar , Yaron Ostrovsky-Berman , Miriam Farber , Ron Sokolovsky

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. However, one critical limitation of these models is the low fidelity of generated images with respect to the text description, such as…

Computer Vision and Pattern Recognition · Computer Science 2023-04-11 Qiucheng Wu , Yujian Liu , Handong Zhao , Trung Bui , Zhe Lin , Yang Zhang , Shiyu Chang

Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think

The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control…

Computer Vision and Pattern Recognition · Computer Science 2025-02-28 Liang Chen , Shuai Bai , Wenhao Chai , Weichu Xie , Haozhe Zhao , Leon Vinci , Junyang Lin , Baobao Chang

3DEnhancer: Consistent Multi-View Diffusion for 3D Enhancement

Despite advances in neural rendering, due to the scarcity of high-quality 3D datasets and the inherent limitations of multi-view diffusion models, view synthesis and 3D model generation are restricted to low resolutions with suboptimal…

Computer Vision and Pattern Recognition · Computer Science 2025-04-30 Yihang Luo , Shangchen Zhou , Yushi Lan , Xingang Pan , Chen Change Loy

Detector Guidance for Multi-Object Text-to-Image Generation

Diffusion models have demonstrated impressive performance in text-to-image generation. They utilize a text encoder and cross-attention blocks to infuse textual information into images at a pixel level. However, their capability to generate…

Computer Vision and Pattern Recognition · Computer Science 2023-06-06 Luping Liu , Zijian Zhang , Yi Ren , Rongjie Huang , Xiang Yin , Zhou Zhao

MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Multi-view generation with camera pose control and prompt-based customization are both essential elements for achieving controllable generative models. However, existing multi-view generation models do not support customization with…

Computer Vision and Pattern Recognition · Computer Science 2026-03-12 Minjung Shin , Hyunin Cho , Sooyeon Go , Jin-Hwa Kim , Youngjung Uh

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D…

Computer Vision and Pattern Recognition · Computer Science 2024-09-12 Haibo Yang , Yang Chen , Yingwei Pan , Ting Yao , Zhineng Chen , Chong-Wah Ngo , Tao Mei

GeoMVD: Geometry-Enhanced Multi-View Generation Model Based on Geometric Information Extraction

Multi-view image generation holds significant application value in computer vision, particularly in domains like 3D reconstruction, virtual reality, and augmented reality. Most existing methods, which rely on extending single images, face…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Jiaqi Wu , Yaosen Chen , Shuyuan Zhu

Dual Diffusion for Unified Image Generation and Understanding

Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-language models. We propose a large-scale and fully end-to-end…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Zijie Li , Henry Li , Yichun Shi , Amir Barati Farimani , Yuval Kluger , Linjie Yang , Peng Wang

MEAT: Multiview Diffusion Model for Human Generation on Megapixels with Mesh Attention

Multiview diffusion models have shown considerable success in image-to-3D generation for general objects. However, when applied to human data, existing methods have yet to deliver promising results, largely due to the challenges of scaling…

Computer Vision and Pattern Recognition · Computer Science 2025-03-12 Yuhan Wang , Fangzhou Hong , Shuai Yang , Liming Jiang , Wayne Wu , Chen Change Loy

Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Peng Li , Yuan Liu , Xiaoxiao Long , Feihu Zhang , Cheng Lin , Mengfei Li , Xingqun Qi , Shanghang Zhang , Wenhan Luo , Ping Tan , Wenping Wang , Qifeng Liu , Yike Guo

Envision3D: One Image to 3D with Anchor Views Interpolation

We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. Recent methods that extract 3D content from multi-view images generated by diffusion models show great potential. However, it is…

Computer Vision and Pattern Recognition · Computer Science 2024-03-15 Yatian Pang , Tanghui Jia , Yujun Shi , Zhenyu Tang , Junwu Zhang , Xinhua Cheng , Xing Zhou , Francis E. H. Tay , Li Yuan

Dense Text-to-Image Generation with Attention Modulation

Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a…

Computer Vision and Pattern Recognition · Computer Science 2023-08-25 Yunji Kim , Jiyoung Lee , Jin-Hwa Kim , Jung-Woo Ha , Jun-Yan Zhu

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

Multi-view image diffusion models have significantly advanced open-domain 3D object generation. However, most existing models rely on 2D network architectures that lack inherent 3D biases, resulting in compromised geometric consistency. To…

Computer Vision and Pattern Recognition · Computer Science 2025-02-21 Hansheng Chen , Bokui Shen , Yulin Liu , Ruoxi Shi , Linqi Zhou , Connor Z. Lin , Jiayuan Gu , Hao Su , Gordon Wetzstein , Leonidas Guibas