Related papers: Think Then Embed: Generative Context Improves Mult…

Embed-RL: Reinforcement Learning for Reasoning-Driven Multimodal Embeddings

Leveraging Multimodal Large Language Models (MLLMs) has become pivotal for advancing Universal Multimodal Embeddings (UME) in addressing diverse cross-modal tasks. Recent studies demonstrate that incorporating generative Chain-of-Thought…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Haonan Jiang , Yuji Wang , Yongjie Zhu , Xin Lu , Wenyu Qin , Meng Wang , Pengfei Wan , Yansong Tang

Reasoning Guided Embeddings: Leveraging MLLM Reasoning for Improved Multimodal Retrieval

Multimodal embeddings are widely used in downstream tasks such as multimodal retrieval, enabling alignment of interleaved modalities in a shared representation space. While recent studies show that Multimodal Large Language Models (MLLMs)…

Computer Vision and Pattern Recognition · Computer Science 2025-11-21 Chunxu Liu , Jiyuan Yang , Ruopeng Gao , Yuhan Zhu , Feng Zhu , Rui Zhao , Limin Wang

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation…

Machine Learning · Computer Science 2026-03-03 Zhibin Lan , Liqiang Niu , Fandong Meng , Jie Zhou , Jinsong Su

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with…

Artificial Intelligence · Computer Science 2026-05-19 Jianpeng Cheng , Xian Wu , Jiangfan Zhang , Wentao Bao , Chaitanya Ahuja , Shlok Kumar Mishra , Hanchao Yu , Yang Gao , Fan Xia , Qi Guo , Shaodan Zhai , Xiangjun Fan , Jun Xiao

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in…

Computer Vision and Pattern Recognition · Computer Science 2026-05-15 Longxiang Zhang , Weilong Dai , Guanghao Zhang , Hao Jiang , Pipei Huang

TRACE: Task-Adaptive Reasoning and Representation Learning for Universal Multimodal Retrieval

Universal Multimodal Retrieval requires unified embedding models capable of interpreting diverse user intents, ranging from simple keywords to complex compositional instructions. While Multimodal Large Language Models (MLLMs) possess strong…

Computer Vision and Pattern Recognition · Computer Science 2026-03-05 Xiangzhao Hao , Shijie Wang , Tianyu Yang , Tianyue Wang , Haiyun Guo , Jinqiao Wang

PLUME: Latent Reasoning Based Universal Multimodal Embedding

Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Chenwei He , Xiangzhao Hao , Tianyu Yang , Yuxiang Ma , Yuheng Jia , Lingxiang Wu , Chaoyang Zhao , Haiyun Guo , Jinqiao Wang

MMEmb-R1: Reasoning-Enhanced Multimodal Embedding with Pair-Aware Selection and Adaptive Control

MLLMs have been successfully applied to multimodal embedding tasks, yet their generative reasoning capabilities remain underutilized. Directly incorporating chain-of-thought reasoning into embedding learning introduces two fundamental…

Computer Vision and Pattern Recognition · Computer Science 2026-04-08 Yuchi Wang , Haiyang Yu , Weikang Bian , Jiefeng Long , Xiao Liang , Chao Feng , Hongsheng Li

Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

Multimodal Large Language Models (MLLMs) have emerged as a promising foundation for universal multimodal embeddings. Recent studies have shown that reasoning-driven generative multimodal embeddings can outperform discriminative embeddings…

Computer Vision and Pattern Recognition · Computer Science 2026-05-29 Peixi Wu , Ke Mei , Feipeng Ma , Bosong Chai , Zhibin Lan , Chenxi Zhao , Shannan Yan , Jie Chen , Zhangchi Hu , Yansong Peng , Bo Lin , Junjie Zhou , Dacheng Yin , Tianyi Wang , Fengyun Rao , Jing Lyu , Hebei Li , Xiaoyan Sun

Think-Then-Generate: Reasoning-Aware Text-to-Image Diffusion with LLM Encoders

Recent progress in text-to-image (T2I) diffusion models (DMs) has enabled high-quality visual synthesis from diverse textual prompts. Yet, most existing T2I DMs, even those equipped with large language model (LLM)-based text encoders,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-16 Siqi Kou , Jiachun Jin , Zetong Zhou , Ye Ma , Yugang Wang , Quan Chen , Peng Jiang , Xiao Yang , Jun Zhu , Kai Yu , Zhijie Deng

GEM: Empowering LLM for both Embedding Generation and Language Understanding

Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG),…

Computation and Language · Computer Science 2025-06-06 Caojin Zhang , Qiang Zhang , Ke Li , Sai Vidyaranya Nuthalapati , Benyu Zhang , Jason Liu , Serena Li , Lizhu Zhang , Xiangjun Fan

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

Reason to Contrast: A Cascaded Multimodal Retrieval Framework

Traditional multimodal retrieval systems rely primarily on bi-encoder architectures, where performance is closely tied to embedding dimensionality. Recent work, Think-Then-Embed (TTE), shows that incorporating multimodal reasoning to elicit…

Information Retrieval · Computer Science 2026-03-02 Xuanming Cui , Hong-You Chen , Hao Yu , Hao Yuan , Zihao Wang , Shlok Kumar Mishra , Hanchao Yu , Yonghuan Yang , Jun Xiao , Ser-Nam Lim , Jianpeng Cheng , Qi Guo , Xiangjun Fan

Exploring Reasoning-Infused Text Embedding with Large Language Models for Zero-Shot Dense Retrieval

Transformer-based models such as BERT and E5 have significantly advanced text embedding by capturing rich contextual representations. However, many complex real-world queries require sophisticated reasoning to retrieve relevant documents…

Computation and Language · Computer Science 2025-09-03 Yuxiang Liu , Tian Wang , Gourab Kundu , Tianyu Cao , Guang Cheng , Zhen Ge , Jianshu Chen , Qingjun Cui , Trishul Chilimbi

Search-R3: Unifying Reasoning and Embedding in Large Language Models

Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to…

Computation and Language · Computer Science 2026-04-10 Yuntao Gui , James Cheng

Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual…

Computation and Language · Computer Science 2026-03-26 Peijun Qing , Puneet Mathur , Nedim Lipka , Varun Manjunatha , Ryan Rossi , Franck Dernoncourt , Saeed Hassanpour , Soroush Vosoughi

Encode, Think, Decode: Scaling test-time reasoning with recursive latent thoughts

Most efforts to improve the reasoning capabilities of large language models (LLMs) involve either scaling the number of parameters and the size of training data, or scaling inference computation by letting models generate complex chains of…

Machine Learning · Computer Science 2025-10-10 Yeskendir Koishekenov , Aldo Lipani , Nicola Cancedda

MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning

Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific…

Artificial Intelligence · Computer Science 2025-07-03 Guiyao Tie , Xueyang Zhou , Tianhe Gu , Ruihang Zhang , Chaoran Hu , Sizhe Zhang , Mengqu Sun , Yan Zhang , Pan Zhou , Lichao Sun

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Qianying Liu , Xiao Liang , Zhiqiang Zhang , Zhongfei Qing , Fengfan Zhou , Yibo Chen , Xu Tang , Yao Hu , Paul Henderson

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang