English
Related papers

Related papers: MACRO: Advancing Multi-Reference Image Generation …

200 papers

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs --…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 Ruoxi Chen , Dongping Chen , Siyuan Wu , Sinan Wang , Shiyun Lang , Petr Sushko , Gaoyang Jiang , Yao Wan , Ranjay Krishna

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; that is, to inherit the appearance of subjects from multiple reference images and re-render them in new contexts. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Yuta Oshima , Daiki Miyake , Kohsei Matsutani , Yusuke Iwasawa , Masahiro Suzuki , Yutaka Matsuo , Hiroki Furuta

While modern visual generation models excel at creating aesthetically pleasing natural images, they struggle with producing or editing structured visuals like charts, diagrams, and mathematical figures, which demand composition planning,…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Le Zhuo , Songhao Han , Yuandong Pu , Boxiang Qiu , Sayak Paul , Yue Liao , Yihao Liu , Jie Shao , Xi Chen , Si Liu , Hongsheng Li

Advancements in large pre-trained generative models have expanded their potential as effective data generators in visual recognition. This work delves into the impact of generative images, primarily comparing paradigms that harness external…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Bo Li , Haotian Liu , Liangyu Chen , Yong Jae Lee , Chunyuan Li , Ziwei Liu

Despite the advancements and impressive performance of Multimodal Large Language Models (MLLMs) on benchmarks, their effectiveness in real-world, long-context, and multi-image tasks is unclear due to the benchmarks' limited scope. Existing…

Computation and Language · Computer Science 2024-05-16 Dingjie Song , Shunian Chen , Guiming Hardy Chen , Fei Yu , Xiang Wan , Benyou Wang

Recent advances in multi-modal generative models have driven substantial improvements in image editing. However, current generative models still struggle with handling diverse and complex image editing tasks that require implicit reasoning,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Feng Han , Yibin Wang , Chenglin Li , Zheming Liang , Dianyi Wang , Yang Jiao , Zhipeng Wei , Chao Gong , Cheng Jin , Jingjing Chen , Jiaqi Wang

Image captioning requires numerous annotated image-text pairs, resulting in substantial annotation costs. Recently, large models (e.g. diffusion models and large language models) have excelled in producing high-quality images and text. This…

Computer Vision and Pattern Recognition · Computer Science 2023-12-20 Feipeng Ma , Yizhou Zhou , Fengyun Rao , Yueyi Zhang , Xiaoyan Sun

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distributions. With the recent advances in generative models, researchers have started augmenting these static…

Computer Vision and Pattern Recognition · Computer Science 2024-09-11 Reyhane Askari Hemmat , Mohammad Pezeshki , Florian Bordes , Michal Drozdzal , Adriana Romero-Soriano

Recent advances in text-to-image (T2I) generation have enabled visually coherent image synthesis from descriptions, but generating images containing multiple given subjects remains challenging. As the number of reference identities…

Machine Learning · Computer Science 2026-04-10 Yucheng Zhou , Dubing Chen , Huan Zheng , Jianbing Shen

Recent advancements in Unified Multimodal Models (UMMs) have enabled remarkable image understanding and generation capabilities. However, while models like Gemini-2.5-Flash-Image show emerging abilities to reason over multiple related…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Mingrui Wu , Hang Liu , Jiayi Ji , Xiaoshuai Sun , Rongrong Ji

Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Haowei Liu , Xi Zhang , Haiyang Xu , Yaya Shi , Chaoya Jiang , Ming Yan , Ji Zhang , Fei Huang , Chunfeng Yuan , Bing Li , Weiming Hu

In controllable image generation, synthesizing coherent and consistent images from multiple reference inputs, i.e., Multi-Image Composition (MICo), remains a challenging problem, partly hindered by the lack of high-quality training data. To…

Computer Vision and Pattern Recognition · Computer Science 2026-04-29 Xinyu Wei , Kangrui Cen , Hongyang Wei , Zhen Guo , Kai Cui , Bairui Li , Zeqing Wang , Jinrui Zhang , Lei Zhang

Deep generative models, which target reproducing the given data distribution to produce novel samples, have made unprecedented advancements in recent years. Their technical breakthroughs have enabled unparalleled quality in the synthesis of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Mengping Yang , Zhe Wang

Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic…

Information Retrieval · Computer Science 2025-10-01 Junjie Zhou , Ze Liu , Lei Xiong , Jin-Ge Yao , Yueze Wang , Shitao Xiao , Fenfen Lin , Miguel Hu Chen , Zhicheng Dou , Siqi Bao , Defu Lian , Yongping Xiong , Zheng Liu

Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with…

Computation and Language · Computer Science 2023-09-22 Elisa Kreiss , Eric Zelikman , Christopher Potts , Nick Haber

Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. It is a challenging yet crucial area with numerous real-world applications in multimedia, affective computing, robotics,…

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Yexin Liu , Manyuan Zhang , Yueze Wang , Hongyu Li , Dian Zheng , Weiming Zhang , Changsheng Lu , Xunliang Cai , Yan Feng , Peng Pei , Harry Yang

The performance of unified multimodal models for image generation and editing is fundamentally constrained by the quality and comprehensiveness of their training data. While existing datasets have covered basic tasks like style transfer and…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zhihong Chen , Xuehai Bai , Yang Shi , Chaoyou Fu , Huanyu Zhang , Haotian Wang , Xiaoyan Sun , Zhang Zhang , Liang Wang , Yuanxing Zhang , Pengfei Wan , Yi-Fan Zhang

Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To…

Computer Vision and Pattern Recognition · Computer Science 2025-06-05 Bimsara Pathiraja , Maitreya Patel , Shivam Singh , Yezhou Yang , Chitta Baral

Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Peizhou Huang , Zixuan Zhong , Zhongwei Wan , Donghao Zhou , Samiul Alam , Xin Wang , Zexin Li , Zhihao Dou , Li Zhu , Jing Xiong , Chaofan Tao , Yan Xu , Dimitrios Dimitriadis , Tuo Zhang , Mi Zhang
‹ Prev 1 2 3 10 Next ›