Related papers: Contextual Document Embeddings

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could…

Information Retrieval · Computer Science 2025-06-09 Max Conti , Manuel Faysse , Gautier Viaud , Antoine Bosselut , Céline Hudelot , Pierre Colombo

Distilling Semantic Concept Embeddings from Contrastively Fine-Tuned Language Models

Learning vectors that capture the meaning of concepts remains a fundamental challenge. Somewhat surprisingly, perhaps, pre-trained language models have thus far only enabled modest improvements to the quality of such concept embeddings.…

Computation and Language · Computer Science 2023-05-18 Na Li , Hanane Kteich , Zied Bouraoui , Steven Schockaert

CODER: An efficient framework for improving retrieval through COntextual Document Embedding Reranking

Contrastive learning has been the dominant approach to training dense retrieval models. In this work, we investigate the impact of ranking context - an often overlooked aspect of learning dense retrieval models. In particular, we examine…

Information Retrieval · Computer Science 2023-10-24 George Zerveas , Navid Rekabsaz , Daniel Cohen , Carsten Eickhoff

Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes

Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a…

Information Retrieval · Computer Science 2025-01-20 Fusheng Wei , Robert Neary , Han Qin , Qiang Mao , Jianping Zhang

The Medium Is Not the Message: Deconfounding Document Embeddings via Linear Concept Erasure

Embedding-based similarity metrics between text sequences can be influenced not just by the content dimensions we most care about, but can also be biased by spurious attributes like the text's source or language. These document confounders…

Computation and Language · Computer Science 2025-09-25 Yu Fan , Yang Tian , Shauli Ravfogel , Mrinmaya Sachan , Elliott Ash , Alexander Hoyle

More Robust Dense Retrieval with Contrastive Dual Learning

Dense retrieval conducts text retrieval in the embedding space and has shown many advantages compared to sparse retrieval. Existing dense retrievers optimize representations of queries and documents with contrastive training and map them to…

Information Retrieval · Computer Science 2021-07-19 Yizhi Li , Zhenghao Liu , Chenyan Xiong , Zhiyuan Liu

Text Embeddings for Retrieval From a Large Knowledge Base

Text embedding representing natural language documents in a semantic vector space can be used for document retrieval using nearest neighbor lookup. In order to study the feasibility of neural models specialized for retrieval in a…

Information Retrieval · Computer Science 2019-05-03 Tolgahan Cakaloglu , Christian Szegedy , Xiaowei Xu

MCSE: Multimodal Contrastive Learning of Sentence Embeddings

Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal…

Computation and Language · Computer Science 2022-04-26 Miaoran Zhang , Marius Mosbach , David Ifeoluwa Adelani , Michael A. Hedderich , Dietrich Klakow

Sentence Compression as Deletion with Contextual Embeddings

Sentence compression is the task of creating a shorter version of an input sentence while keeping important information. In this paper, we extend the task of compression by deletion with the use of contextual embeddings. Different from…

Information Retrieval · Computer Science 2020-06-08 Minh-Tien Nguyen , Bui Cong Minh , Dung Tien Le , Le Thai Linh

On Debiasing Text Embeddings Through Context Injection

Current advances in Natural Language Processing (NLP) have made it increasingly feasible to build applications leveraging textual data. Generally, the core of these applications rely on having a good semantic representation of text into…

Computation and Language · Computer Science 2024-10-21 Thomas Uriot

A Dual Embedding Space Model for Document Ranking

A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words…

Information Retrieval · Computer Science 2016-02-04 Bhaskar Mitra , Eric Nalisnick , Nick Craswell , Rich Caruana

Learning Contextualised Cross-lingual Word Embeddings and Alignments for Extremely Low-Resource Languages Using Parallel Corpora

We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus (e.g. a few hundred sentence pairs). Our method obtains word embeddings via an LSTM encoder-decoder model that…

Computation and Language · Computer Science 2021-10-22 Takashi Wada , Tomoharu Iwata , Yuji Matsumoto , Timothy Baldwin , Jey Han Lau

Learning Conceptual-Contextual Embeddings for Medical Text

External knowledge is often useful for natural language understanding tasks. We introduce a contextual text representation model called Conceptual-Contextual (CC) embeddings, which incorporates structured knowledge into text…

Computation and Language · Computer Science 2020-03-13 Xiao Zhang , Dejing Dou , Ji Wu

Repurposing Language Models into Embedding Models: Finding the Compute-Optimal Recipe

Text embeddings are essential for many tasks, such as document retrieval, clustering, and semantic similarity assessment. In this paper, we study how to contrastively train text embedding models in a compute-optimal fashion, given a suite…

Machine Learning · Computer Science 2024-11-22 Alicja Ziarko , Albert Q. Jiang , Bartosz Piotrowski , Wenda Li , Mateja Jamnik , Piotr Miłoś

Debiasing Pre-trained Contextualised Embeddings

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method…

Computation and Language · Computer Science 2021-01-26 Masahiro Kaneko , Danushka Bollegala

Context-Aware Embeddings for Automatic Art Analysis

Automatic art analysis aims to classify and retrieve artistic representations from a collection of images by using computer vision and machine learning techniques. In this work, we propose to enhance visual representations from neural…

Computer Vision and Pattern Recognition · Computer Science 2019-04-11 Noa Garcia , Benjamin Renoust , Yuta Nakashima

Cross-Lingual Contextual Word Embeddings Mapping With Multi-Sense Words In Mind

Recent work in cross-lingual contextual word embedding learning cannot handle multi-sense words well. In this work, we explore the characteristics of contextual word embeddings and show the link between contextual word embeddings and word…

Computation and Language · Computer Science 2019-09-20 Zheng Zhang , Ruiqing Yin , Jun Zhu , Pierre Zweigenbaum

Contextual Embeddings: When Are They Worth It?

We study the settings for which deep contextual embeddings (e.g., BERT) give large improvements in performance relative to classic pretrained embeddings (e.g., GloVe), and an even simpler baseline---random word embeddings---focusing on the…

Computation and Language · Computer Science 2020-05-20 Simran Arora , Avner May , Jian Zhang , Christopher Ré

A Multi-Resolution Word Embedding for Document Retrieval from Large Unstructured Knowledge Bases

Deep language models learning a hierarchical representation proved to be a powerful tool for natural language processing, text mining and information retrieval. However, representations that perform well for retrieval must capture semantic…

Information Retrieval · Computer Science 2019-05-24 Tolgahan Cakaloglu , Xiaowei Xu

Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be over-compressed in the embeddings. Consequently,…

Computation and Language · Computer Science 2025-07-08 Michael Günther , Isabelle Mohr , Daniel James Williams , Bo Wang , Han Xiao