Related papers: Object-level Visual Prompts for Compositional Imag…

VSC: Visual Search Compositional Text-to-Image Diffusion Model

Text-to-image diffusion models have shown impressive capabilities in generating realistic visuals from natural-language prompts, yet they often struggle with accurately binding attributes to corresponding objects, especially in prompts…

Computer Vision and Pattern Recognition · Computer Science 2025-05-05 Do Huu Dat , Nam Hyeonu , Po-Yuan Mao , Tae-Hyun Oh

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

Diffusion-based generative models have significantly advanced text-to-image generation but encounter challenges when processing lengthy and intricate text prompts describing complex scenes with multiple objects. While excelling in…

Computer Vision and Pattern Recognition · Computer Science 2024-02-27 Hanan Gani , Shariq Farooq Bhat , Muzammal Naseer , Salman Khan , Peter Wonka

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing…

Computer Vision and Pattern Recognition · Computer Science 2023-08-15 Or Patashnik , Daniel Garibi , Idan Azuri , Hadar Averbuch-Elor , Daniel Cohen-Or

Sketch-Guided Scene Image Generation

Text-to-image models are showcasing the impressive ability to create high-quality and diverse generative images. Nevertheless, the transition from freehand sketches to complex scene images remains challenging using diffusion models. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-07-10 Tianyu Zhang , Xiaoxuan Xie , Xusheng Du , Haoran Xie

ObjectComposer: Consistent Generation of Multiple Objects Without Fine-tuning

Recent text-to-image generative models can generate high-fidelity images from text prompts. However, these models struggle to consistently generate the same objects in different contexts with the same appearance. Consistent object…

Computer Vision and Pattern Recognition · Computer Science 2023-10-12 Alec Helbling , Evan Montoya , Duen Horng Chau

Visual Style Prompting with Swapping Self-Attention

In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a…

Computer Vision and Pattern Recognition · Computer Science 2024-02-22 Jaeseok Jeong , Junho Kim , Yunjey Choi , Gayoung Lee , Youngjung Uh

Leveraging Text-to-Image Diffusion Models for Unsupervised Visual Object Tracking

Unsupervised visual object tracking is a challenging task that requires following arbitrary targets in videos without training on ground-truth annotations. Despite considerable progress, existing state-of-the-art unsupervised trackers often…

Computer Vision and Pattern Recognition · Computer Science 2026-05-27 Zhengbo Zhang , Zhigang Tu , Junsong Yuan , De Wen Soh , Bo Du

Composing Concepts from Images and Videos via Concept-prompt Binding

Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining…

Computer Vision and Pattern Recognition · Computer Science 2026-03-17 Xianghao Kong , Zeyu Zhang , Yuwei Guo , Zhuoran Zhao , Songchun Zhang , Anyi Rao

ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a…

Computer Vision and Pattern Recognition · Computer Science 2025-10-17 Guocheng Gordon Qian , Daniil Ostashev , Egor Nemchinov , Avihay Assouline , Sergey Tulyakov , Kuan-Chieh Jackson Wang , Kfir Aberman

Compositional Image Synthesis with Inference-Time Scaling

Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Minsuk Ji , Sanghyeok Lee , Namhyuk Ahn

Multi-modal Generation via Cross-Modal In-Context Learning

In this work, we study the problem of generating novel images from complex multimodal prompt sequences. While existing methods achieve promising results for text-to-image generation, they often struggle to capture fine-grained details from…

Computer Vision and Pattern Recognition · Computer Science 2024-05-29 Amandeep Kumar , Muzammal Naseer , Sanath Narayan , Rao Muhammad Anwer , Salman Khan , Hisham Cholakkal

Obtaining Favorable Layouts for Multiple Object Generation

Large-scale text-to-image models that can generate high-quality and diverse images based on textual prompts have shown remarkable success. These models aim ultimately to create complex scenes, and addressing the challenge of multi-subject…

Computer Vision and Pattern Recognition · Computer Science 2024-05-03 Barak Battash , Amit Rozner , Lior Wolf , Ofir Lindenbaum

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Yusuf Dalva , Guocheng Gordon Qian , Maya Goldenberg , Tsai-Shien Chen , Kfir Aberman , Sergey Tulyakov , Pinar Yanardag , Kuan-Chieh Jackson Wang

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and…

Computer Vision and Pattern Recognition · Computer Science 2026-02-04 Yijia Xu , Zihao Wang , Jinshi Cui

Generating Compositional Scenes via Text-to-image RGBA Instance Generation

Text-to-image diffusion generative models can generate high quality images at the cost of tedious prompt engineering. Controllability can be improved by introducing layout conditioning, however existing methods lack layout editing ability…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Alessandro Fontanella , Petru-Daniel Tudosiu , Yongxin Yang , Shifeng Zhang , Sarah Parisot

Descriminative-Generative Custom Tokens for Vision-Language Models

This paper explores the possibility of learning custom tokens for representing new concepts in Vision-Language Models (VLMs). Our aim is to learn tokens that can be effective for both discriminative and generative tasks while composing well…

Computer Vision and Pattern Recognition · Computer Science 2025-02-18 Pramuditha Perera , Matthew Trager , Luca Zancato , Alessandro Achille , Stefano Soatto

VLM-Guided Adaptive Negative Prompting for Creative Generation

Creative generation is the synthesis of new, surprising, and valuable samples that reflect user intent yet cannot be envisioned in advance. This task aims to extend human imagination, enabling the discovery of visual concepts that exist in…

Graphics · Computer Science 2025-10-14 Shelly Golan , Yotam Nitzan , Zongze Wu , Or Patashnik

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Maxwell Mbabilla Aladago , AJ Piergiovanni

Visual Question Answering based on Local-Scene-Aware Referring Expression Generation

Visual question answering requires a deep understanding of both images and natural language. However, most methods mainly focus on visual concept; such as the relationships between various objects. The limited use of object categories…

Computer Vision and Pattern Recognition · Computer Science 2021-01-25 Jung-Jun Kim , Dong-Gyu Lee , Jialin Wu , Hong-Gyu Jung , Seong-Whan Lee

Textual Localization: Decomposing Multi-concept Images for Subject-Driven Text-to-Image Generation

Subject-driven text-to-image diffusion models empower users to tailor the model to new concepts absent in the pre-training dataset using a few sample images. However, prevalent subject-driven models primarily rely on single-concept input…

Computer Vision and Pattern Recognition · Computer Science 2024-02-16 Junjie Shentu , Matthew Watson , Noura Al Moubayed