English

ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Computer Vision and Pattern Recognition 2024-07-30 v2

Abstract

3D asset generation is getting massive amounts of attention, inspired by the recent success of text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data, which often results in non-photorealistic 3D objects without backgrounds. In this paper, we present a method that leverages pretrained text-to-image models as a prior, and learn to generate multi-view images in a single denoising process from real-world data. Concretely, we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover, we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods, the results generated by our method are consistent, and have favorable visual quality (-30% FID, -37% KID).

Keywords

Cite

@article{arxiv.2403.01807,
  title  = {ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models},
  author = {Lukas Höllein and Aljaž Božič and Norman Müller and David Novotny and Hung-Yu Tseng and Christian Richardt and Michael Zollhöfer and Matthias Nießner},
  journal= {arXiv preprint arXiv:2403.01807},
  year   = {2024}
}

Comments

Accepted to CVPR 2024, project page: https://lukashoel.github.io/ViewDiff/, video: https://www.youtube.com/watch?v=SdjoCqHzMMk, code: https://github.com/facebookresearch/ViewDiff

R2 v1 2026-06-28T15:08:01.411Z