English
Related papers

Related papers: Adaptive Chunking: Optimizing Chunking-Method Sele…

200 papers

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of…

Computation and Language · Computer Science 2024-03-19 Antonio Jimeno Yepes , Yao You , Jan Milczek , Sebastian Laverde , Renyu Li

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning…

Computation and Language · Computer Science 2026-03-10 Muhammad Arslan Shaukat , Muntasir Adnan , Carlos C. N. Kuhn

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This…

Computation and Language · Computer Science 2025-10-10 Wensheng Lu , Keyu Chen , Ruizhi Qiao , Xing Sun

Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static…

Information Retrieval · Computer Science 2026-02-27 Xuechen Zhang , Koustava Goswami , Samet Oymak , Jiasi Chen , Nedim Lipka

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a…

Computation and Language · Computer Science 2026-05-27 Mudit Rastogi

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality.…

Information Retrieval · Computer Science 2026-03-26 Samuel Taiwo , Mohd Amaluddin Yusoff

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and…

Machine Learning · Computer Science 2025-07-15 Vishesh Tripathi , Tanmay Odapally , Indraneel Das , Uday Allu , Biddwan Ahmed

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper…

Computation and Language · Computer Science 2025-05-22 Jihao Zhao , Zhiyuan Ji , Yuchen Feng , Pengnian Qi , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies…

Computation and Language · Computer Science 2026-01-21 Sofia Bennani , Charles Moslonka

Retrieval-Augmented Generation (RAG) systems using large language models (LLMs) often generate inaccurate responses due to the retrieval of irrelevant or loosely related information. Existing methods, which operate at the document level,…

Computation and Language · Computer Science 2025-04-24 Ishneet Sukhvinder Singh , Ritvik Aggarwal , Ibrahim Allahverdiyev , Muhammad Taha , Aslihan Akalin , Kevin Zhu , Sean O'Brien

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and…

Computation and Language · Computer Science 2025-07-15 Hai Toan Nguyen , Tien Dat Nguyen , Viet Ha Nguyen

Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully…

Information Retrieval · Computer Science 2026-04-08 Uday Allu , Sonu Kedia , Tanmay Odapally , Biddwan Ahmed

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation…

Computation and Language · Computer Science 2025-05-27 Jihao Zhao , Zhiyuan Ji , Zhaoxin Fan , Hanyu Wang , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

Retrieval-augmented generation (RAG) pipelines for code completion rely on chunking to segment source files into retrievable units, yet chunking strategies are typically adopted without empirical justification, and practitioner…

Software Engineering · Computer Science 2026-05-07 Xinjian Wu , Jingzhi Gong , Gunel Jahangirova , Jie Zhang

Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of…

Information Retrieval · Computer Science 2025-04-29 Carlo Merola , Jaspinder Singh

Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design…

Information Retrieval · Computer Science 2025-06-24 Mahmoud Amiri , Thomas Bocklitz

The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes…

Computation and Language · Computer Science 2026-03-13 Jihao Zhao , Daixuan Li , Pengfei Li , Shuaishuai Zu , Biao Qin , Hongyan Liu

Retrieval-Augmented Generation (RAG) has proven effective in open-domain question answering. However, the chunking process, which is essential to this pipeline, often receives insufficient attention relative to retrieval and synthesis…

Computation and Language · Computer Science 2025-01-20 Zuhong Liu , Charles-Elie Simon , Fabien Caspani

Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well…

Information Retrieval · Computer Science 2026-02-17 Christos Koutsiaris
‹ Prev 1 2 3 10 Next ›