English
Related papers

Related papers: M3-Embedding: Multi-Linguality, Multi-Functionalit…

200 papers

Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual…

Information Retrieval · Computer Science 2025-12-04 Adithya S Kolavi , Vyoman Jain

Transformer-based large language models (LLMs) rely on contextual embeddings which generate different (continuous) representations for the same token depending on its surrounding context. Nonetheless, words and tokens typically have a…

Computation and Language · Computer Science 2025-07-10 Qitong Wang , Mohammed J. Zaki , Georgios Kollias , Vasileios Kalantzis

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision…

Computation and Language · Computer Science 2026-01-21 Mingxin Li , Yanzhao Zhang , Dingkun Long , Keqin Chen , Sibo Song , Shuai Bai , Zhibo Yang , Pengjun Xie , An Yang , Dayiheng Liu , Jingren Zhou , Junyang Lin

Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to…

Computation and Language · Computer Science 2026-04-10 Yuntao Gui , James Cheng

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs'…

Computation and Language · Computer Science 2025-06-12 Yanzhao Zhang , Mingxin Li , Dingkun Long , Xin Zhang , Huan Lin , Baosong Yang , Pengjun Xie , An Yang , Dayiheng Liu , Junyang Lin , Fei Huang , Jingren Zhou

Deep language models learning a hierarchical representation proved to be a powerful tool for natural language processing, text mining and information retrieval. However, representations that perform well for retrieval must capture semantic…

Information Retrieval · Computer Science 2019-05-24 Tolgahan Cakaloglu , Xiaowei Xu

We introduce EmbeddingGemma, a new lightweight, open text embedding model based on the Gemma 3 language model family. Our innovative training recipe strategically captures knowledge from larger models via encoder-decoder initialization and…

Multimodal embedding models aim to map heterogeneous inputs, such as text, images, videos, and audio, into a shared semantic space. However, existing methods and benchmarks remain largely limited to partial modality coverage, making it…

Information Retrieval · Computer Science 2026-04-28 Haohang Huang , Xuan Lu , Mingyi Su , Xuan Zhang , Ziyan Jiang , Ping Nie , Kai Zou , Tomas Pfister , Wenhu Chen , Wei Zhang , Xiaoyu Shen , Rui Meng

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when…

Computation and Language · Computer Science 2026-05-20 Yaoxiang Wang , Simiao Zuo , Qingguo Hu , Yucheng Ding , Yeyun Gong , Jian Jiao , Jinsong Su

We present an easy and efficient method to extend existing sentence embedding models to new languages. This allows to create multilingual versions from previously monolingual models. The training is based on the idea that a translated…

Computation and Language · Computer Science 2020-10-06 Nils Reimers , Iryna Gurevych

In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal…

Information Retrieval · Computer Science 2024-03-22 Yang Bai , Anthony Colas , Christan Grant , Daisy Zhe Wang

LLMs confront inherent limitations in terms of its knowledge, memory, and action. The retrieval augmentation stands as a vital mechanism to address these limitations, which brings in useful information from external sources to augment the…

Information Retrieval · Computer Science 2026-01-06 Peitian Zhang , Shitao Xiao , Zheng Liu , Zhicheng Dou , Jian-Yun Nie

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve…

Computer Vision and Pattern Recognition · Computer Science 2026-05-25 Jongoh Jeong , Hoyong Kwon , Minseok Kim , Kuk-Jin Yoon

This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and…

Computation and Language · Computer Science 2024-07-29 Yihao Ding , Lorenzo Vaiani , Caren Han , Jean Lee , Paolo Garza , Josiah Poon , Luca Cagliero

Multilingual semantic search is the task of retrieving relevant contents to a query expressed in different language combinations. This requires a better semantic understanding of the user's intent and its contextual meaning. Multilingual…

Computation and Language · Computer Science 2023-09-18 Meryem M'hamdi , Jonathan May , Franck Dernoncourt , Trung Bui , Seunghyun Yoon

State-of-the-art retrieval models typically address a straightforward search scenario, in which retrieval tasks are fixed (e.g., finding a passage to answer a specific question) and only a single modality is supported for both queries and…

Computation and Language · Computer Science 2025-02-25 Sheng-Chieh Lin , Chankyu Lee , Mohammad Shoeybi , Jimmy Lin , Bryan Catanzaro , Wei Ping

Massively multilingual sentence representation models, e.g., LASER, SBERT-distill, and LaBSE, help significantly improve cross-lingual downstream tasks. However, the use of a large amount of data or inefficient model architectures results…

Computation and Language · Computer Science 2024-05-31 Zhuoyuan Mao , Chenhui Chu , Sadao Kurohashi

Learning a distinct representation for each sense of an ambiguous word could lead to more powerful and fine-grained models of vector-space representations. Yet while `multi-sense' methods have been proposed and tested on artificial…

Computation and Language · Computer Science 2015-11-25 Jiwei Li , Dan Jurafsky

This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided,…

Computation and Language · Computer Science 2024-02-09 Liang Wang , Nan Yang , Xiaolong Huang , Linjun Yang , Rangan Majumder , Furu Wei

We introduce the Granite Embedding models, a family of encoder-based embedding models designed for retrieval tasks, spanning dense-retrieval and sparse retrieval architectures, with both English and Multilingual capabilities. This report…

‹ Prev 1 2 3 10 Next ›