Related papers: SceneMotifCoder: Example-driven Visual Program Lea…

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yilei Jiang , Yaozhi Zheng , Yuxuan Wan , Jiaming Han , Qunzhong Wang , Michael R. Lyu , Xiangyu Yue

Auto-Encoding Progressive Generative Adversarial Networks For 3D Multi Object Scenes

3D multi object generative models allow us to synthesize a large range of novel 3D multi object scenes and also identify objects, shapes, layouts and their positions. But multi object scenes are difficult to create because of the dataset…

Computer Vision and Pattern Recognition · Computer Science 2019-03-11 Vedant Singh , Manan Oza , Himanshu Vaghela , Pratik Kanani

Text To 3D Object Generation For Scalable Room Assembly

Modern machine learning models for scene understanding, such as depth estimation and object tracking, rely on large, high-quality datasets that mimic real-world deployment scenarios. To address data scarcity, we propose an end-to-end system…

Computer Vision and Pattern Recognition · Computer Science 2025-04-15 Sonia Laguna , Alberto Garcia-Garcia , Marie-Julie Rakotosaona , Stylianos Moschoglou , Leonhard Helminger , Sergio Orts-Escolano

Structured Generative Models for Scene Understanding

This position paper argues for the use of \emph{structured generative models} (SGMs) for the understanding of static scenes. This requires the reconstruction of a 3D scene from an input image (or a set of multi-view images), whereby the…

Computer Vision and Pattern Recognition · Computer Science 2024-12-16 Christopher K. I. Williams

Multiview Scene Graph

A proper scene representation is central to the pursuit of spatial intelligence where agents can robustly reconstruct and efficiently understand 3D scenes. A scene representation is either metric, such as landmark maps in 3D reconstruction,…

Computer Vision and Pattern Recognition · Computer Science 2024-11-21 Juexiao Zhang , Gao Zhu , Sihang Li , Xinhao Liu , Haorui Song , Xinran Tang , Chen Feng

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene…

Computer Vision and Pattern Recognition · Computer Science 2025-05-06 Lu Ling , Chen-Hsuan Lin , Tsung-Yi Lin , Yifan Ding , Yu Zeng , Yichen Sheng , Yunhao Ge , Ming-Yu Liu , Aniket Bera , Zhaoshuo Li

Multi-modal Generation via Cross-Modal In-Context Learning

In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Amandeep Kumar , Muzammal Naseer , Sanath Narayan , Rao Muhammad Anwer , Salman Khan , Hisham Cholakkal

3DFroMLLM: 3D Prototype Generation only from Pretrained Multimodal LLMs

Recent Multi-Modal Large Language Models (MLLMs) have demonstrated strong capabilities in learning joint representations from text and images. However, their spatial reasoning remains limited. We introduce 3DFroMLLM, a novel framework that…

Computer Vision and Pattern Recognition · Computer Science 2025-08-13 Noor Ahmed , Cameron Braunstein , Steffen Eger , Eddy Ilg

SceneWiz3D: Towards Text-guided 3D Scene Composition

We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Qihang Zhang , Chaoyang Wang , Aliaksandr Siarohin , Peiye Zhuang , Yinghao Xu , Ceyuan Yang , Dahua Lin , Bolei Zhou , Sergey Tulyakov , Hsin-Ying Lee

LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding

3D visual grounding aims to localize the unique target described by natural languages in 3D scenes. The significant gap between 3D and language modalities makes it a notable challenge to distinguish multiple similar objects through the…

Computer Vision and Pattern Recognition · Computer Science 2025-08-18 Feng Xiao , Hongbin Xu , Guocan Zhao , Wenxiong Kang

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

We present MMCORE, a unified framework designed for multimodal image generation and editing. MMCORE leverages a pre-trained Vision-Language Model (VLM) to predict semantic visual embeddings via learnable query tokens, which subsequently…

Computer Vision and Pattern Recognition · Computer Science 2026-04-23 Zijie Li , Yichun Shi , Jingxiang Sun , Ye Wang , Yixuan Huang , Zhiyao Guo , Xiaochen Lian , Peihao Zhu , Yu Tian , Zhonghua Zhai , Peng Wang

SceneLinker: Compositional 3D Scene Generation via Semantic Scene Graph from RGB Sequences

We introduce SceneLinker, a novel framework that generates compositional 3D scenes via semantic scene graph from RGB sequences. To adaptively experience Mixed Reality (MR) content based on each user's space, it is essential to generate a 3D…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Seok-Young Kim , Dooyoung Kim , Woojin Cho , Hail Song , Suji Kang , Woontack Woo

FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement

Scene generation with 3D assets presents a complex challenge, requiring both high-level semantic understanding and low-level geometric reasoning. While Multimodal Large Language Models (MLLMs) excel at semantic tasks, their application to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-10 Ian Huang , Yanan Bao , Karen Truong , Howard Zhou , Cordelia Schmid , Leonidas Guibas , Alireza Fathi

Open-Universe Indoor Scene Generation using LLM Program Synthesis and Uncurated Object Databases

We present a system for generating indoor scenes in response to text prompts. The prompts are not limited to a fixed vocabulary of scene descriptions, and the objects in generated scenes are not restricted to a fixed set of object…

Computer Vision and Pattern Recognition · Computer Science 2024-03-18 Rio Aguina-Kang , Maxim Gumin , Do Heon Han , Stewart Morris , Seung Jean Yoo , Aditya Ganeshan , R. Kenny Jones , Qiuhong Anna Wei , Kailiang Fu , Daniel Ritchie

Toward Scene Graph and Layout Guided Complex 3D Scene Generation

Recent advancements in object-centric text-to-3D generation have shown impressive results. However, generating complex 3D scenes remains an open challenge due to the intricate relations between objects. Moreover, existing methods are…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Yu-Hsiang Huang , Wei Wang , Sheng-Yu Huang , Yu-Chiang Frank Wang

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities…

Machine Learning · Computer Science 2026-04-10 Yucheng Zhou , Dubing Chen , Huan Zheng , Jianbing Shen

VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models

Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding. However, their ability to generate code from multimodal inputs remains limited. In this work, we introduce VisCodex, a…

Computation and Language · Computer Science 2025-08-14 Lingjie Jiang , Shaohan Huang , Xun Wu , Yixia Li , Dongdong Zhang , Furu Wei

SceneDecorator: Towards Scene-Oriented Story Generation with Scene Planning and Scene Consistency

Recent text-to-image models have revolutionized image generation, but they still struggle with maintaining concept consistency across generated images. While existing works focus on character consistency, they often overlook the crucial…

Computer Vision and Pattern Recognition · Computer Science 2025-10-28 Quanjian Song , Donghao Zhou , Jingyu Lin , Fei Shen , Jiaze Wang , Xiaowei Hu , Cunjian Chen , Pheng-Ann Heng

Multilingual Multimodal Software Developer for Code Generation

The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To…

Computation and Language · Computer Science 2025-07-14 Linzheng Chai , Jian Yang , Shukai Liu , Wei Zhang , Liran Wang , Ke Jin , Tao Sun , Congnan Liu , Chenchen Zhang , Hualei Zhu , Jiaheng Liu , Xianjie Wu , Ge Zhang , Tianyu Liu , Zhoujun Li

CC3D: Layout-Conditioned Generation of Compositional 3D Scenes

In this work, we introduce CC3D, a conditional generative model that synthesizes complex 3D scenes conditioned on 2D semantic scene layouts, trained using single-view images. Different from most existing 3D GANs that limit their…

Computer Vision and Pattern Recognition · Computer Science 2023-09-12 Sherwin Bahmani , Jeong Joon Park , Despoina Paschalidou , Xingguang Yan , Gordon Wetzstein , Leonidas Guibas , Andrea Tagliasacchi