Related papers: LayoutBERT: Masked Language Layout Model for Objec…

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them.…

Computer Vision and Pattern Recognition · Computer Science 2020-01-24 Di Qi , Lin Su , Jia Song , Edward Cui , Taroon Bharti , Arun Sacheti

Context-Aware Synthesis and Placement of Object Instances

Learning to insert an object instance into an image in a semantically coherent manner is a challenging and interesting problem. Solving it requires (a) determining a location to place an object in the scene and (b) determining its…

Computer Vision and Pattern Recognition · Computer Science 2018-12-10 Donghoon Lee , Sifei Liu , Jinwei Gu , Ming-Yu Liu , Ming-Hsuan Yang , Jan Kautz

TopNet: Transformer-based Object Placement Network for Image Compositing

We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale)…

Computer Vision and Pattern Recognition · Computer Science 2023-04-10 Sijie Zhu , Zhe Lin , Scott Cohen , Jason Kuen , Zhifei Zhang , Chen Chen

Insert Anything: Image Insertion via In-Context Editing in DiT

This work presents Insert Anything, a unified framework for reference-based image insertion that seamlessly integrates objects from reference images into target scenes under flexible, user-specified control guidance. Instead of training…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Wensong Song , Hong Jiang , Zongxing Yang , Ruijie Quan , Yi Yang

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Weixi Feng , Wanrong Zhu , Tsu-jui Fu , Varun Jampani , Arjun Akula , Xuehai He , Sugato Basu , Xin Eric Wang , William Yang Wang

Thinking Outside the BBox: Unconstrained Generative Object Compositing

Compositing an object into an image involves multiple non-trivial sub-tasks such as object placement and scaling, color/lighting harmonization, viewpoint/geometry adjustment, and shadow/reflection generation. Recent generative image…

Computer Vision and Pattern Recognition · Computer Science 2024-09-12 Gemma Canet Tarrés , Zhe Lin , Zhifei Zhang , Jianming Zhang , Yizhi Song , Dan Ruta , Andrew Gilbert , John Collomosse , Soo Ye Kim

Making Images Real Again: A Comprehensive Survey on Deep Image Composition

As a common image editing operation, image composition (object insertion) aims to combine the foreground from one image and another background image, to produce a composite image. However, there are many issues that could make the composite…

Computer Vision and Pattern Recognition · Computer Science 2026-03-20 Li Niu , Wenyan Cong , Liu Liu , Yan Hong , Bo Zhang , Jing Liang , Liqing Zhang

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. We aim to build a more accurate and thorough connection between image…

Computer Vision and Pattern Recognition · Computer Science 2020-06-23 Zhicheng Huang , Zhaoyang Zeng , Bei Liu , Dongmei Fu , Jianlong Fu

DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing

Recently, how to achieve precise image editing has attracted increasing attention, especially given the remarkable success of text-to-image generation models. To unify various spatial-aware image editing abilities into one framework, we…

Computer Vision and Pattern Recognition · Computer Science 2024-03-22 Yueru Jia , Yuhui Yuan , Aosong Cheng , Chuke Wang , Ji Li , Huizhu Jia , Shanghang Zhang

VisualBERT: A Simple and Performant Baseline for Vision and Language

We propose VisualBERT, a simple and flexible framework for modeling a broad range of vision-and-language tasks. VisualBERT consists of a stack of Transformer layers that implicitly align elements of an input text and regions in an…

Computer Vision and Pattern Recognition · Computer Science 2019-08-12 Liunian Harold Li , Mark Yatskar , Da Yin , Cho-Jui Hsieh , Kai-Wei Chang

Improving Editability in Image Generation with Layer-wise Memory

Most real-world image editing tasks require multiple sequential edits to achieve desired results. Current editing approaches, primarily designed for single-object modifications, struggle with sequential editing: especially with maintaining…

Computer Vision and Pattern Recognition · Computer Science 2025-05-05 Daneul Kim , Jaeah Lee , Jaesik Park

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

Image editing has advanced significantly with the introduction of text-conditioned diffusion models. Despite this progress, seamlessly adding objects to images based on textual instructions without requiring user-provided input masks…

Computer Vision and Pattern Recognition · Computer Science 2025-03-21 Navve Wasserman , Noam Rotstein , Roy Ganz , Ron Kimmel

EraseDraw: Learning to Draw Step-by-Step via Erasing Objects from Images

Creative processes such as painting often involve creating different components of an image one by one. Can we build a computational model to perform this task? Prior works often fail by making global changes to the image, inserting objects…

Computer Vision and Pattern Recognition · Computer Science 2024-12-25 Alper Canberk , Maksym Bondarenko , Ege Ozguroglu , Ruoshi Liu , Carl Vondrick

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers

In this paper, we tackle the copy-paste image-to-image composition problem with a focus on object placement learning. Prior methods have leveraged generative models to reduce the reliance for dense supervision. However, this often limits…

Computer Vision and Pattern Recognition · Computer Science 2025-03-31 Hang Zhou , Xinxin Zuo , Rui Ma , Li Cheng

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal…

Computer Vision and Pattern Recognition · Computer Science 2020-07-21 Yen-Chun Chen , Linjie Li , Licheng Yu , Ahmed El Kholy , Faisal Ahmed , Zhe Gan , Yu Cheng , Jingjing Liu

Synthesizing Training Data for Object Detection in Indoor Scenes

Detection of objects in cluttered indoor environments is one of the key enabling functionalities for service robots. The best performing object detection approaches in computer vision exploit deep Convolutional Neural Networks (CNN) to…

Computer Vision and Pattern Recognition · Computer Science 2017-09-11 Georgios Georgakis , Arsalan Mousavian , Alexander C. Berg , Jana Kosecka

Text2Layer: Layered Image Generation using Latent Diffusion Model

Layer compositing is one of the most popular image editing workflows among both amateurs and professionals. Motivated by the success of diffusion models, we explore layer compositing from a layered image generation perspective. Instead of…

Computer Vision and Pattern Recognition · Computer Science 2023-07-20 Xinyang Zhang , Wentian Zhao , Xin Lu , Jeff Chien

Instance Segmentation based Semantic Matting for Compositing Applications

Image compositing is a key step in film making and image editing that aims to segment a foreground object and combine it with a new background. Automatic image compositing can be done easily in a studio using chroma-keying when the…

Computer Vision and Pattern Recognition · Computer Science 2019-04-12 Guanqing Hu , James J. Clark

OPA: Object Placement Assessment Dataset

Image composition aims to generate realistic composite image by inserting an object from one image into another background image, where the placement (e.g., location, size, occlusion) of inserted object may be unreasonable, which would…

Computer Vision and Pattern Recognition · Computer Science 2022-06-22 Liu Liu , Zhenchen Liu , Bo Zhang , Jiangtong Li , Li Niu , Qingyang Liu , Liqing Zhang

ObjectMate: A Recurrence Prior for Object Insertion and Subject-Driven Generation

This paper introduces a tuning-free method for both object insertion and subject-driven generation. The task involves composing an object, given multiple views, into a scene specified by either an image or text. Existing methods struggle to…

Computer Vision and Pattern Recognition · Computer Science 2024-12-12 Daniel Winter , Asaf Shul , Matan Cohen , Dana Berman , Yael Pritch , Alex Rav-Acha , Yedid Hoshen