Related papers: How Does Chunking Affect Retrieval-Augmented Code …

Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to…

Computation and Language · Computer Science 2026-03-27 Paulo Roberto de Moura Júnior , Jean Lelong , Annabelle Blangero

Evaluating Chunking Strategies For Retrieval-Augmented Generation in Oil and Gas Enterprise Documents

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality.…

Information Retrieval · Computer Science 2026-03-26 Samuel Taiwo , Mohd Amaluddin Yusoff

A Systematic Analysis of Chunking Strategies for Reliable Question Answering

We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies…

Computation and Language · Computer Science 2026-01-21 Sofia Bennani , Charles Moslonka

cAST: Enhancing Code Retrieval-Augmented Generation with Structural Chunking via Abstract Syntax Tree

Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the…

Software Engineering · Computer Science 2025-10-06 Yilin Zhang , Xinran Zhao , Zora Zhiruo Wang , Chenyang Yang , Jiayi Wei , Tongshuang Wu

Optimizing Retrieval-Augmented Generation: Analysis of Hyperparameter Impact on Performance and Efficiency

Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters…

Machine Learning · Computer Science 2025-05-14 Adel Ammar , Anis Koubaa , Omer Nacar , Wadii Boulila

Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design…

Information Retrieval · Computer Science 2025-06-24 Mahmoud Amiri , Thomas Bocklitz

Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation

Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent…

Information Retrieval · Computer Science 2025-12-02 Aparajitha Allamraju , Maitreya Prafulla Chitale , Hiranmai Sri Adibhatla , Rahul Mishra , Manish Shrivastava

Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

Standard Retrieval-Augmented Generation (RAG) chunking methods often create excessive redundancy, increasing storage costs and slowing retrieval. This study explores chunk filtering strategies, such as semantic, topic-based, and…

Computation and Language · Computer Science 2026-04-28 Daria Berdyugina , Anaëlle Cohen , Yohann Rioual

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Impact-driven Context Filtering For Cross-file Code Completion

Retrieval-augmented generation (RAG) has recently demonstrated considerable potential for repository-level code completion, as it integrates cross-file knowledge with in-file preceding code to provide comprehensive contexts for generation.…

Software Engineering · Computer Science 2025-08-11 Yanzhou Li , Shangqing Liu , Kangjie Chen , Tianwei Zhang , Yang Liu

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation…

Computation and Language · Computer Science 2025-05-27 Jihao Zhao , Zhiyuan Ji , Zhaoxin Fan , Hanyu Wang , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods

The performance of Retrieval-Augmented Generation (RAG) systems in information retrieval is significantly influenced by the characteristics of the documents being processed. In this study, the structured nature of textbooks, the conciseness…

Information Retrieval · Computer Science 2024-09-23 Esmaeil Narimissa , David Raithel

Is Semantic Chunking Worth the Computational Cost?

Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the…

Computation and Language · Computer Science 2024-10-18 Renyi Qu , Ruixuan Tu , Forrest Bao

HiChunk: Evaluating and Enhancing Retrieval-Augmented Generation with Hierarchical Chunking

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This…

Computation and Language · Computer Science 2025-10-10 Wensheng Lu , Keyu Chen , Ruizhi Qiao , Xing Sun

Financial Report Chunking for Effective Retrieval Augmented Generation

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of…

Computation and Language · Computer Science 2024-03-19 Antonio Jimeno Yepes , Yao You , Jan Milczek , Sebastian Laverde , Renyu Li

A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat

Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of large language models (LLMs). In recent years, retrieval-augmented generation…

Software Engineering · Computer Science 2025-07-25 Zezhou Yang , Ting Peng , Cuiyun Gao , Chaozheng Wang , Hailiang Huang , Yuetang Deng

Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully…

Information Retrieval · Computer Science 2026-04-08 Uday Allu , Sonu Kedia , Tanmay Odapally , Biddwan Ahmed

Practical Code RAG at Scale: Task-Aware Retrieval Design Choices under Compute Budgets

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations…

Machine Learning · Computer Science 2025-10-24 Timur Galimzyanov , Olga Kolomyttseva , Egor Bogomolov

RAG or Fine-tuning? A Comparative Study on LCMs-based Code Completion in Industry

Code completion, a crucial practice in industrial settings, helps developers improve programming efficiency by automatically suggesting code snippets during development. With the emergence of Large Code Models (LCMs), this field has…

Software Engineering · Computer Science 2025-05-22 Chaozheng Wang , Zezhou Yang , Shuzheng Gao , Cuiyun Gao , Ting Peng , Hailiang Huang , Yuetang Deng , Michael Lyu

A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning…

Computation and Language · Computer Science 2026-03-10 Muhammad Arslan Shaukat , Muntasir Adnan , Carlos C. N. Kuhn