English
Related papers

Related papers: Think Then Embed: Generative Context Improves Mult…

200 papers

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Haonan Jiang , Yuji Wang , Yongjie Zhu , Xin Lu , Wenyu Qin , Meng Wang , Pengfei Wan , Yansong Tang

Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs)…

Computer Vision and Pattern Recognition · Computer Science 2025-11-21 Chunxu Liu , Jiyuan Yang , Ruopeng Gao , Yuhan Zhu , Feng Zhu , Rui Zhao , Limin Wang

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation…

Machine Learning · Computer Science 2026-03-03 Zhibin Lan , Liqiang Niu , Fandong Meng , Jie Zhou , Jinsong Su

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with…

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Longxiang Zhang , Weilong Dai , Guanghao Zhang , Hao Jiang , Pipei Huang

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Xiangzhao Hao , Shijie Wang , Tianyu Yang , Tianyue Wang , Haiyun Guo , Jinqiao Wang

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Chenwei He , Xiangzhao Hao , Tianyu Yang , Yuxiang Ma , Yuheng Jia , Lingxiang Wu , Chaoyang Zhao , Haiyun Guo , Jinqiao Wang

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Yuchi Wang , Haiyang Yu , Weikang Bian , Jiefeng Long , Xiao Liang , Chao Feng , Hongsheng Li

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Peixi Wu , Ke Mei , Feipeng Ma , Bosong Chai , Zhibin Lan , Chenxi Zhao , Shannan Yan , Jie Chen , Zhangchi Hu , Yansong Peng , Bo Lin , Junjie Zhou , Dacheng Yin , Tianyi Wang , Fengyun Rao , Jing Lyu , Hebei Li , Xiaoyan Sun

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Siqi Kou , Jiachun Jin , Zetong Zhou , Ye Ma , Yugang Wang , Quan Chen , Peng Jiang , Xiao Yang , Jun Zhu , Kai Yu , Zhijie Deng

Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG),…

Computation and Language · Computer Science 2025-06-06 Caojin Zhang , Qiang Zhang , Ke Li , Sai Vidyaranya Nuthalapati , Benyu Zhang , Jason Liu , Serena Li , Lizhu Zhang , Xiangjun Fan

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

Traditional multimodal retrieval systems rely primarily on bi-encoder architectures, where performance is closely tied to embedding dimensionality. Recent work, Think-Then-Embed (TTE), shows that incorporating multimodal reasoning to elicit…

Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents…

Computation and Language · Computer Science 2025-09-03 Yuxiang Liu , Tian Wang , Gourab Kundu , Tianyu Cao , Guang Cheng , Zhen Ge , Jianshu Chen , Qingjun Cui , Trishul Chilimbi

Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to…

Computation and Language · Computer Science 2026-04-10 Yuntao Gui , James Cheng

General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual…

Computation and Language · Computer Science 2026-03-26 Peijun Qing , Puneet Mathur , Nedim Lipka , Varun Manjunatha , Ryan Rossi , Franck Dernoncourt , Saeed Hassanpour , Soroush Vosoughi

Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of…

Machine Learning · Computer Science 2025-10-10 Yeskendir Koishekenov , Aldo Lipani , Nicola Cancedda

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific…

Artificial Intelligence · Computer Science 2025-07-03 Guiyao Tie , Xueyang Zhou , Tianhe Gu , Ruihang Zhang , Chaoran Hu , Sizhe Zhang , Mengqu Sun , Yan Zhang , Pan Zhou , Lichao Sun

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Qianying Liu , Xiao Liang , Zhiqiang Zhang , Zhongfei Qing , Fengfan Zhou , Yibo Chen , Xu Tang , Yao Hu , Paul Henderson

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang
‹ Prev 1 2 3 10 Next ›