Related papers: Late Chunking: Contextual Chunk Embeddings Using L…

Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval

Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods…

Information Retrieval · Computer Science 2026-02-20 Yongjie Zhou , Shuai Wang , Bevan Koopman , Guido Zuccon

Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of…

Information Retrieval · Computer Science 2025-04-29 Carlo Merola , Jaspinder Singh

Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings

A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could…

Information Retrieval · Computer Science 2025-06-09 Max Conti , Manuel Faysse , Gautier Viaud , Antoine Bosselut , Céline Hudelot , Pierre Colombo

Dynamic Chunking and Selection for Reading Comprehension of Ultra-Long Context in Large Language Models

Large language models (LLMs) often struggle to accurately read and comprehend extremely long texts. Current methods for improvement typically rely on splitting long contexts into fixed-length chunks. However, fixed truncation risks…

Computation and Language · Computer Science 2025-06-04 Boheng Sheng , Jiacheng Yao , Meicong Zhang , Guoxiu He

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning…

Computation and Language · Computer Science 2026-03-10 Muhammad Arslan Shaukat , Muntasir Adnan , Carlos C. N. Kuhn

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we proposeExtensible…

Computation and Language · Computer Science 2024-02-20 Kun Luo , Zheng Liu , Shitao Xiao , Kang Liu

Recurrent Chunking Mechanisms for Long-Text Machine Reading Comprehension

In this paper, we study machine reading comprehension (MRC) on long texts, where a model takes as inputs a lengthy document and a question and then extracts a text span from the document as an answer. State-of-the-art models tend to use a…

Computation and Language · Computer Science 2020-05-20 Hongyu Gong , Yelong Shen , Dian Yu , Jianshu Chen , Dong Yu

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Yibo Yan , Mingdong Ou , Yi Cao , Jiahao Huo , Xin Zou , Shuliang Liu , James Kwok , Xuming Hu

On Debiasing Text Embeddings Through Context Injection

Current advances in Natural Language Processing (NLP) have made it increasingly feasible to build applications leveraging textual data. Generally, the core of these applications rely on having a good semantic representation of text into…

Computation and Language · Computer Science 2024-10-21 Thomas Uriot

An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models

With the development of pre-trained language models, the dense retrieval models have become promising alternatives to the traditional retrieval models that rely on exact match and sparse bag-of-words representations. Different from most…

Information Retrieval · Computer Science 2024-03-21 Qi Liu , Gang Guo , Jiaxin Mao , Zhicheng Dou , Ji-Rong Wen , Hao Jiang , Xinyu Zhang , Zhao Cao

Contextual Document Embeddings

Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are…

Computation and Language · Computer Science 2024-11-11 John X. Morris , Alexander M. Rush

END: Early Noise Dropping for Efficient and Effective Context Denoising

Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. However, they are often distracted by irrelevant or noisy context in input sequences that degrades output…

Computation and Language · Computer Science 2026-05-21 Hongye Jin , Pei Chen , Jingfeng Yang , Zhengyang Wang , Fangran Mo , Jinghan Zhang , Meng Jiang , Yifan Gao , Binxuan Huang , Xinyang Zhang , Zheng Li , Tianyi Liu , Huasheng Li , Bing Yin

Extensible Embedding: A Flexible Multipler For LLM's Context Length

Large language models (LLMs) call for extension of context to handle many critical applications. However, the existing approaches are prone to expensive costs and inferior quality of context extension. In this work, we propose Extensible…

Computation and Language · Computer Science 2024-02-20 Ninglu Shao , Shitao Xiao , Zheng Liu , Peitian Zhang

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Contextualized Query Embeddings for Conversational Search

This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations. Prior to our work, the state-of-the-art approach uses a multi-stage pipeline comprising…

Information Retrieval · Computer Science 2021-11-30 Sheng-Chieh Lin , Jheng-Hong Yang , Jimmy Lin

Sentence Compression as Deletion with Contextual Embeddings

Sentence compression is the task of creating a shorter version of an input sentence while keeping important information. In this paper, we extend the task of compression by deletion with the use of contextual embeddings. Different from…

Information Retrieval · Computer Science 2020-06-08 Minh-Tien Nguyen , Bui Cong Minh , Dung Tien Le , Le Thai Linh

A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens

Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the LLM-based…

Computation and Language · Computer Science 2025-07-08 Zhijie Nie , Richong Zhang , Zhanyu Wu

Efficient Long Context Fine-tuning with Chunk Flow

Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-14 Xiulong Yuan , Hongtao Xu , Wenting Shen , Ang Wang , Xiafei Qiu , Jie Zhang , Yuqiong Liu , Bowen Yu , Junyang Lin , Mingzhen Li , Weile Jia , Yong Li , Wei Lin

Empirical Evaluation of Embedding Models in the Context of Text Classification in Document Review in Construction Delay Disputes

Text embeddings are numerical representations of text data, where words, phrases, or entire documents are converted into vectors of real numbers. These embeddings capture semantic meanings and relationships between text elements in a…

Information Retrieval · Computer Science 2025-01-20 Fusheng Wei , Robert Neary , Han Qin , Qiang Mao , Jianping Zhang

Relative Positioning Based Code Chunking Method For Rich Context Retrieval In Repository Level Code Completion Task With Code Language Model

Code completion can help developers improve efficiency and ease the development lifecycle. Although code completion is available in modern integrated development environments (IDEs), research lacks in determining what makes a good context…

Software Engineering · Computer Science 2025-10-13 Imranur Rahman , Md Rayhanur Rahman