Related papers: Adaptive Chunking: Optimizing Chunking-Method Sele…

Financial Report Chunking for Effective Retrieval Augmented Generation

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of…

Computation and Language · Computer Science 2024-03-19 Antonio Jimeno Yepes , Yao You , Jan Milczek , Sebastian Laverde , Renyu Li

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning…

Computation and Language · Computer Science 2026-03-10 Muhammad Arslan Shaukat , Muntasir Adnan , Carlos C. N. Kuhn

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This…

Computation and Language · Computer Science 2025-10-10 Wensheng Lu , Keyu Chen , Ruizhi Qiao , Xing Sun

SmartChunk Retrieval: Query-Aware Chunk Compression with Planning for Efficient Document RAG

Retrieval-augmented generation (RAG) has strong potential for producing accurate and factual outputs by combining language models (LMs) with evidence retrieved from large text corpora. However, current pipelines are limited by static…

Information Retrieval · Computer Science 2026-02-27 Xuechen Zhang , Koustava Goswami , Samet Oymak , Jiasi Chen , Nedim Lipka

Query-Adaptive Semantic Chunking for Retrieval-Augmented Generation: A Dynamic Strategy with Contextual Window Expansion

Retrieval-Augmented Generation (RAG) systems depend critically on document chunking quality for retrieving relevant context. Fixed chunking segments documents into uniform units irrespective of semantics or user intent, producing a…

Computation and Language · Computer Science 2026-05-27 Mudit Rastogi

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality.…

Information Retrieval · Computer Science 2026-03-26 Samuel Taiwo , Mohd Amaluddin Yusoff

Vision-Guided Chunking Is All You Need: Enhancing RAG with Multimodal Document Understanding

Retrieval-Augmented Generation (RAG) systems have revolutionized information retrieval and question answering, but traditional text-based chunking methods struggle with complex document structures, multi-page tables, embedded figures, and…

Machine Learning · Computer Science 2025-07-15 Vishesh Tripathi , Tanmay Odapally , Indraneel Das , Uday Allu , Biddwan Ahmed

Meta-Chunking: Learning Text Segmentation and Semantic Completion via Logical Perception

While Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for boosting large language models (LLMs) in knowledge-intensive tasks, it often overlooks the crucial aspect of text chunking within its workflow. This paper…

Computation and Language · Computer Science 2025-05-22 Jihao Zhao , Zhiyuan Ji , Yuchen Feng , Pengnian Qi , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

A Systematic Analysis of Chunking Strategies for Reliable Question Answering

We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies…

Computation and Language · Computer Science 2026-01-21 Sofia Bennani , Charles Moslonka

ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems

Retrieval-Augmented Generation (RAG) systems using large language models (LLMs) often generate inaccurate responses due to the retrieval of irrelevant or loosely related information. Existing methods, which operate at the document level,…

Computation and Language · Computer Science 2025-04-24 Ishneet Sukhvinder Singh , Ritvik Aggarwal , Ibrahim Allahverdiyev , Muhammad Taha , Aslihan Akalin , Kevin Zhu , Sean O'Brien

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking

Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and…

Computation and Language · Computer Science 2025-07-15 Hai Toan Nguyen , Tien Dat Nguyen , Viet Ha Nguyen

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully…

Information Retrieval · Computer Science 2026-04-08 Uday Allu , Sonu Kedia , Tanmay Odapally , Biddwan Ahmed

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation…

Computation and Language · Computer Science 2025-05-27 Jihao Zhao , Zhiyuan Ji , Zhaoxin Fan , Hanyu Wang , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

Retrieval-augmented generation (RAG) pipelines for code completion rely on chunking to segment source files into retrievable units, yet chunking strategies are typically adopted without empirical justification, and practitioner…

Software Engineering · Computer Science 2026-05-07 Xinjian Wu , Jingzhi Gong , Gunel Jahangirova , Jie Zhang

Reconstructing Context: Evaluating Advanced Chunking Strategies for Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) has become a transformative approach for enhancing large language models (LLMs) by grounding their outputs in external knowledge sources. Yet, a critical question persists: how can vast volumes of…

Information Retrieval · Computer Science 2025-04-29 Carlo Merola , Jaspinder Singh

Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design…

Information Retrieval · Computer Science 2025-06-24 Mahmoud Amiri , Thomas Bocklitz

QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes…

Computation and Language · Computer Science 2026-03-13 Jihao Zhao , Daixuan Li , Pengfei Li , Shuaishuai Zu , Biao Qin , Hongyan Liu

Passage Segmentation of Documents for Extractive Question Answering

Retrieval-Augmented Generation (RAG) has proven effective in open-domain question answering. However, the chunking process, which is essential to this pipeline, often receives insufficient attention relative to retrieval and synthesis…

Computation and Language · Computer Science 2025-01-20 Zuhong Liu , Charles-Elie Simon , Fabien Caspani

Intent-Driven Dynamic Chunking: Segmenting Documents to Reflect Predicted Information Needs

Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well…

Information Retrieval · Computer Science 2026-02-17 Christos Koutsiaris