Related papers: Multi-Modal Generative Embedding Model

GEM: Empowering LLM for both Embedding Generation and Language Understanding

Large decoder-only language models (LLMs) have achieved remarkable success in generation and reasoning tasks, where they generate text responses given instructions. However, many applications, e.g., retrieval augmented generation (RAG),…

Computation and Language · Computer Science 2025-06-06 Caojin Zhang , Qiang Zhang , Ke Li , Sai Vidyaranya Nuthalapati , Benyu Zhang , Jason Liu , Serena Li , Lizhu Zhang , Xiangjun Fan

Generating Images with Multimodal Language Models

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image…

Computation and Language · Computer Science 2023-10-16 Jing Yu Koh , Daniel Fried , Ruslan Salakhutdinov

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

GEM: Generative Supervision Helps Embodied Intelligence

Embodied Vision-Language Models (VLMs) have demonstrated impressive performance and generalization in robotics, particularly within Vision-Language-Action frameworks. However, a significant gap remains between the high-level semantic focus…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Ruowen Zhao , Bangguo Li , Zuyan Liu , Yinan Liang , Junliang Ye , Fangfu Liu , Diankun Wu , Zhengyi Wang , Xumin Yu , Yongming Rao , Han Hu , Jun Zhu

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable…

Multimedia · Computer Science 2024-02-19 Yongqi Li , Wenjie Wang , Leigang Qu , Liqiang Nie , Wenjie Li , Tat-Seng Chua

CREM: Compression-Driven Representation Enhancement for Multimodal Retrieval and Comprehension

Multimodal Large Language Models (MLLMs) have shown remarkable success in comprehension tasks such as visual description and visual question answering. However, their direct application to embedding-based tasks like retrieval remains…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Lihao Liu , Yan Wang , Biao Yang , Da Li , Jiangxia Cao , Yuxiao Luo , Xiang Chen , Xiangyu Wu , Wei Yuan , Fan Yang , Guiguang Ding , Tingting Gao , Guorui Zhou

Generative Multi-Modal Knowledge Retrieval with Large Language Models

Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when…

Information Retrieval · Computer Science 2024-01-17 Xinwei Long , Jiali Zeng , Fandong Meng , Zhiyuan Ma , Kaiyan Zhang , Bowen Zhou , Jie Zhou

EmbedLLM: Learning Compact Representations of Large Language Models

With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn…

Computation and Language · Computer Science 2024-10-18 Richard Zhuang , Tianhao Wu , Zhaojin Wen , Andrew Li , Jiantao Jiao , Kannan Ramchandran

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation…

Machine Learning · Computer Science 2026-03-03 Zhibin Lan , Liqiang Niu , Fandong Meng , Jie Zhou , Jinsong Su

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Sara Ghazanfari , Alexandre Araujo , Prashanth Krishnamurthy , Siddharth Garg , Farshad Khorrami

Benchmarking Retrieval-Augmented Generation in Multi-Modal Contexts

With the rapid advancement of Multi-modal Large Language Models (MLLMs), their capability in understanding both images and text has greatly improved. However, their potential for leveraging multi-modal contextual information in…

Artificial Intelligence · Computer Science 2025-08-08 Zhenghao Liu , Xingsheng Zhu , Tianshuo Zhou , Xinyi Zhang , Xiaoyuan Yi , Yukun Yan , Ge Yu , Maosong Sun

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Weicheng Kuo , AJ Piergiovanni , Dahun Kim , Xiyang Luo , Ben Caine , Wei Li , Abhijit Ogale , Luowei Zhou , Andrew Dai , Zhifeng Chen , Claire Cui , Anelia Angelova

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data…

Computation and Language · Computer Science 2025-01-16 Xinshuo Hu , Zifei Shan , Xinping Zhao , Zetian Sun , Zhenyu Liu , Dongfang Li , Shaolin Ye , Xinyuan Wei , Qian Chen , Baotian Hu , Haofen Wang , Jun Yu , Min Zhang

Unified Generative and Discriminative Training for Multi-modal Large Language Models

In recent times, Vision-Language Models (VLMs) have been trained under two predominant paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) to tackle various complex tasks, yet issues such as hallucinations…

Computer Vision and Pattern Recognition · Computer Science 2024-11-04 Wei Chow , Juncheng Li , Qifan Yu , Kaihang Pan , Hao Fei , Zhiqi Ge , Shuai Yang , Siliang Tang , Hanwang Zhang , Qianru Sun

Vision-Language Semantic Aggregation Leveraging Foundation Model for Generalizable Medical Image Segmentation

Multimodal models have achieved remarkable success in natural image segmentation, yet they often underperform when applied to the medical domain. Through extensive study, we attribute this performance gap to the challenges of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-09-11 Wenjun Yu , Yinchen Zhou , Jia-Xuan Jiang , Shubin Zeng , Yuee Li , Zhong Wang

CEMTM: Contextual Embedding-based Multimodal Topic Modeling

We introduce CEMTM, a context-enhanced multimodal topic model designed to infer coherent and interpretable topic structures from both short and long documents containing text and images. CEMTM builds on fine-tuned large vision language…

Computation and Language · Computer Science 2025-10-07 Amirhossein Abaskohi , Raymond Li , Chuyuan Li , Shafiq Joty , Giuseppe Carenini

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

ReMatch: Boosting Representation through Matching for Multimodal Retrieval

We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Qianying Liu , Xiao Liang , Zhiqiang Zhang , Zhongfei Qing , Fengfan Zhou , Yibo Chen , Xu Tang , Yao Hu , Paul Henderson

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Foundation models update slowly due to resource-intensive training, whereas domain-specific models evolve rapidly between releases. Model merging seeks to combine multiple expert models into a single, more capable model, reducing storage…

Artificial Intelligence · Computer Science 2026-03-04 Yongxian Wei , Runxi Cheng , Weike Jin , Enneng Yang , Li Shen , Lu Hou , Sinan Du , Chun Yuan , Xiaochun Cao , Dacheng Tao