Related papers: M3-Embedding: Multi-Linguality, Multi-Functionalit…

M3DR: Towards Universal Multilingual Multimodal Document Retrieval

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual…

Information Retrieval · Computer Science 2025-12-04 Adithya S Kolavi , Vyoman Jain

Multi-Sense Embeddings for Language Models and Knowledge Distillation

Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a…

Computation and Language · Computer Science 2025-07-10 Qitong Wang , Mohammed J. Zaki , Georgios Kollias , Vasileios Kalantzis

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision…

Computation and Language · Computer Science 2026-01-21 Mingxin Li , Yanzhao Zhang , Dingkun Long , Keqin Chen , Sibo Song , Shuai Bai , Zhibo Yang , Pengjun Xie , An Yang , Dayiheng Liu , Jingren Zhou , Junyang Lin

Search-R3: Unifying Reasoning and Embedding in Large Language Models

Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to…

Computation and Language · Computer Science 2026-04-10 Yuntao Gui , James Cheng

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs'…

Computation and Language · Computer Science 2025-06-12 Yanzhao Zhang , Mingxin Li , Dingkun Long , Xin Zhang , Huan Lin , Baosong Yang , Pengjun Xie , An Yang , Dayiheng Liu , Junyang Lin , Fei Huang , Jingren Zhou

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

Deep language models learning a hierarchical representation proved to be a powerful tool for natural language processing, text mining and information retrieval. However, representations that perform well for retrieval must capture semantic…

Information Retrieval · Computer Science 2019-05-24 Tolgahan Cakaloglu , Xiaowei Xu

EmbeddingGemma: Powerful and Lightweight Text Representations

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and…

Computation and Language · Computer Science 2025-11-04 Henrique Schechter Vera , Sahil Dua , Biao Zhang , Daniel Salz , Ryan Mullins , Sindhu Raghuram Panyam , Sara Smoot , Iftekhar Naim , Joe Zou , Feiyang Chen , Daniel Cer , Alice Lisak , Min Choi , Lucas Gonzalez , Omar Sanseviero , Glenn Cameron , Ian Ballantyne , Kat Black , Kaifeng Chen , Weiyi Wang , Zhe Li , Gus Martins , Jinhyuk Lee , Mark Sherwood , Juyeong Ji , Renjie Wu , Jingxiao Zheng , Jyotinder Singh , Abheesht Sharma , Divyashree Sreepathihalli , Aashi Jain , Adham Elarabawy , AJ Co , Andreas Doumanoglou , Babak Samari , Ben Hora , Brian Potetz , Dahun Kim , Enrique Alfonseca , Fedor Moiseev , Feng Han , Frank Palma Gomez , Gustavo Hernández Ábrego , Hesen Zhang , Hui Hui , Jay Han , Karan Gill , Ke Chen , Koert Chen , Madhuri Shanbhogue , Michael Boratko , Paul Suganthan , Sai Meher Karthik Duddu , Sandeep Mariserla , Setareh Ariafar , Shanfeng Zhang , Shijie Zhang , Simon Baumgartner , Sonam Goenka , Steve Qiu , Tanmaya Dabral , Trevor Walker , Vikram Rao , Waleed Khawaja , Wenlei Zhou , Xiaoqi Ren , Ye Xia , Yichang Chen , Yi-Ting Chen , Zhe Dong , Zhongli Ding , Francesco Visin , Gaël Liu , Jiageng Zhang , Kathleen Kenealy , Michelle Casbon , Ravin Kumar , Thomas Mesnard , Zach Gleicher , Cormac Brick , Olivier Lacombe , Adam Roberts , Qin Yin , Yunhsuan Sung , Raphael Hoffmann , Tris Warkentin , Armand Joulin , Tom Duerig , Mojtaba Seyedhosseini

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

Multimodal embedding models aim to map heterogeneous inputs, such as text, images, videos, and audio, into a shared semantic space. However, existing methods and benchmarks remain largely limited to partial modality coverage, making it…

Information Retrieval · Computer Science 2026-04-28 Haohang Huang , Xuan Lu , Mingyi Su , Xuan Zhang , Ziyan Jiang , Ping Nie , Kai Zou , Tomas Pfister , Wenhu Chen , Wei Zhang , Xiaoyu Shen , Rui Meng

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when…

Computation and Language · Computer Science 2026-05-20 Yaoxiang Wang , Simiao Zuo , Qingguo Hu , Yucheng Ding , Yeyun Gong , Jian Jiao , Jinsong Su

Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated…

Computation and Language · Computer Science 2020-10-06 Nils Reimers , Iryna Gurevych

M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval

In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal…

Information Retrieval · Computer Science 2024-03-22 Yang Bai , Anthony Colas , Christan Grant , Daisy Zhe Wang

A Multi-Task Embedder For Retrieval Augmented LLMs

LLMs confront inherent limitations in terms of its knowledge, memory, and action. The retrieval augmentation stands as a vital mechanism to address these limitations, which brings in useful information from external sources to augment the…

Information Retrieval · Computer Science 2026-01-06 Peitian Zhang , Shitao Xiao , Zheng Liu , Zhicheng Dou , Jian-Yun Nie

Multimodal Distribution Matching for Vision-Language Dataset Distillation

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Jongoh Jeong , Hoyong Kwon , Minseok Kim , Kuk-Jin Yoon

3MVRD: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding

This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and…

Computation and Language · Computer Science 2024-07-29 Yihao Ding , Lorenzo Vaiani , Caren Han , Jean Lee , Paolo Garza , Josiah Poon , Luca Cagliero

Multilingual Sentence-Level Semantic Search using Meta-Distillation Learning

Multilingual semantic search is the task of retrieving relevant contents to a query expressed in different language combinations. This requires a better semantic understanding of the user's intent and its contextual meaning. Multilingual…

Computation and Language · Computer Science 2023-09-18 Meryem M'hamdi , Jonathan May , Franck Dernoncourt , Trung Bui , Seunghyun Yoon

MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

EMS: Efficient and Effective Massively Multilingual Sentence Embedding Learning

Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results…

Computation and Language · Computer Science 2024-05-31 Zhuoyuan Mao , Chenhui Chu , Sadao Kurohashi

Do Multi-Sense Embeddings Improve Natural Language Understanding?

Learning a distinct representation for each sense of an ambiguous word could lead to more powerful and fine-grained models of vector-space representations. Yet while `multi-sense' methods have been proposed and tested on artificial…

Computation and Language · Computer Science 2015-11-25 Jiwei Li , Dan Jurafsky

Multilingual E5 Text Embeddings: A Technical Report

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided,…

Computation and Language · Computer Science 2024-02-09 Liang Wang , Nan Yang , Xiaolong Huang , Linjun Yang , Rangan Majumder , Furu Wei

Granite Embedding Models

We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report…

Information Retrieval · Computer Science 2025-02-28 Parul Awasthy , Aashka Trivedi , Yulong Li , Mihaela Bornea , David Cox , Abraham Daniels , Martin Franz , Gabe Goodhart , Bhavani Iyer , Vishwajeet Kumar , Luis Lastras , Scott McCarley , Rudra Murthy , Vignesh P , Sara Rosenthal , Salim Roukos , Jaydeep Sen , Sukriti Sharma , Avirup Sil , Kate Soule , Arafat Sultan , Radu Florian