Related papers: MACRO: Advancing Multi-Reference Image Generation …

MultiRef: Controllable Image Generation with Multiple Visual References

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs --…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 Ruoxi Chen , Dongping Chen , Siyuan Wu , Sinan Wang , Shiyun Lang , Petr Sushko , Gaoyang Jiang , Yao Wan , Ranjay Krishna

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Yuta Oshima , Daiki Miyake , Kohsei Matsutani , Yusuke Iwasawa , Masahiro Suzuki , Yutaka Matsuo , Hiroki Furuta

Factuality Matters: When Image Generation and Editing Meet Structured Visuals

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Le Zhuo , Songhao Han , Yuandong Pu , Boxiang Qiu , Sayak Paul , Yue Liao , Yihao Liu , Jie Shao , Xi Chen , Si Liu , Hongsheng Li

Benchmarking and Analyzing Generative Data for Visual Recognition

Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Bo Li , Haotian Liu , Liangyu Chen , Yong Jae Lee , Chunyuan Li , Ziwei Liu

MileBench: Benchmarking MLLMs in Long Context

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing…

Computation and Language · Computer Science 2024-05-16 Dingjie Song , Shunian Chen , Guiming Hardy Chen , Fei Yu , Xiang Wan , Benyou Wang

UniREditBench: A Unified Reasoning-based Image Editing Benchmark

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Feng Han , Yibin Wang , Chenglin Li , Zheming Liang , Dianyi Wang , Yang Jiao , Zhipeng Wei , Chao Gong , Cheng Jin , Jingjing Chen , Jiaqi Wang

Image Captioning with Multi-Context Synthetic Data

Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Feipeng Ma , Yizhou Zhou , Fengyun Rao , Yueyi Zhang , Xiaoyan Sun

Feedback-guided Data Synthesis for Imbalanced Classification

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static…

Computer Vision and Pattern Recognition · Computer Science 2024-09-11 Reyhane Askari Hemmat , Mohammad Pezeshki , Florian Bordes , Michal Drozdzal , Adriana Romero-Soriano

Multimodal Large Language Models for Multi-Subject In-Context Image Generation

Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities…

Machine Learning · Computer Science 2026-04-10 Yucheng Zhou , Dubing Chen , Huan Zheng , Jianbing Shen

MICON-Bench: Benchmarking and Enhancing Multi-Image Context Image Generation in Unified Multimodal Models

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Mingrui Wu , Hang Liu , Jiayi Ji , Xiaoshuai Sun , Rongrong Ji

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Haowei Liu , Xi Zhang , Haiyang Xu , Yaya Shi , Chaoya Jiang , Ming Yan , Ji Zhang , Fei Huang , Chunfeng Yuan , Bing Li , Weiming Hu

MICo-150K: A Comprehensive Dataset Advancing Multi-Image Composition

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To…

Computer Vision and Pattern Recognition · Computer Science 2026-04-29 Xinyu Wei , Kangrui Cen , Hongyang Wei , Zhen Guo , Kai Cui , Bairui Li , Zeqing Wang , Jinrui Zhang , Lei Zhang

Image Synthesis under Limited Data: A Survey and Taxonomy

Deep generative models, which target reproducing the given data distribution to produce novel samples, have made unprecedented advancements in recent years. Their technical breakthroughs have enabled unparalleled quality in the synthesis of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Mengping Yang , Zhe Wang

MR$^2$-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic…

Information Retrieval · Computer Science 2025-10-01 Junjie Zhou , Ze Liu , Lei Xiong , Jin-Ge Yao , Yueze Wang , Shitao Xiao , Fenfen Lin , Miguel Hu Chen , Zhicheng Dou , Siqi Bao , Defu Lian , Yongping Xiong , Zheng Liu

ContextRef: Evaluating Referenceless Metrics For Image Description Generation

Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with…

Computation and Language · Computer Science 2023-09-22 Elisa Kreiss , Eric Zelikman , Christopher Potts , Nick Haber

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics,…

Machine Learning · Computer Science 2021-11-11 Paul Pu Liang , Yiwei Lyu , Xiang Fan , Zetian Wu , Yun Cheng , Jason Wu , Leslie Chen , Peter Wu , Michelle A. Lee , Yuke Zhu , Ruslan Salakhutdinov , Louis-Philippe Morency

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Yexin Liu , Manyuan Zhang , Yueze Wang , Hongyu Li , Dian Zheng , Weiming Zhang , Changsheng Lu , Xunliang Cai , Yan Feng , Peng Pei , Harry Yang

OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zhihong Chen , Xuehai Bai , Yang Shi , Chaoyou Fu , Huanyu Zhang , Haotian Wang , Xiaoyan Sun , Zhang Zhang , Liang Wang , Yuanxing Zhang , Pengfei Wan , Yi-Fan Zhang

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To…

Computer Vision and Pattern Recognition · Computer Science 2025-06-05 Bimsara Pathiraja , Maitreya Patel , Shivam Singh , Yezhou Yang , Chitta Baral

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Peizhou Huang , Zixuan Zhong , Zhongwei Wan , Donghao Zhou , Samiul Alam , Xin Wang , Zexin Li , Zhihao Dou , Li Zhu , Jing Xiong , Chaofan Tao , Yan Xu , Dimitrios Dimitriadis , Tuo Zhang , Mi Zhang