English
Related papers

Related papers: How Does Chunking Affect Retrieval-Augmented Code …

200 papers

The effectiveness of Retrieval-Augmented Generation (RAG) is highly dependent on how documents are chunked, that is, segmented into smaller units for indexing and retrieval. Yet, commonly used "one-size-fits-all" approaches often fail to…

Computation and Language · Computer Science 2026-03-27 Paulo Roberto de Moura Júnior , Jean Lelong , Annabelle Blangero

Retrieval-Augmented Generation (RAG) has emerged as a framework to address the constraints of Large Language Models (LLMs). Yet, its effectiveness fundamentally hinges on document chunking - an often-overlooked determinant of its quality.…

Information Retrieval · Computer Science 2026-03-26 Samuel Taiwo , Mohd Amaluddin Yusoff

We study how document chunking choices impact the reliability of Retrieval-Augmented Generation (RAG) systems in industry. While practice often relies on heuristics, our end-to-end evaluation on Natural Questions systematically varies…

Computation and Language · Computer Science 2026-01-21 Sofia Bennani , Charles Moslonka

Retrieval-Augmented Generation (RAG) has become essential for large-scale code generation, grounding predictions in external code corpora to improve actuality. However, a critical yet underexplored aspect of RAG pipelines is chunking -- the…

Software Engineering · Computer Science 2025-10-06 Yilin Zhang , Xinran Zhao , Zora Zhiruo Wang , Chenyang Yang , Jiayi Wei , Tongshuang Wu

Large language models achieve high task performance yet often hallucinate or rely on outdated knowledge. Retrieval-augmented generation (RAG) addresses these gaps by coupling generation with external search. We analyse how hyperparameters…

Machine Learning · Computer Science 2025-05-14 Adel Ammar , Anis Koubaa , Omer Nacar , Wadii Boulila

Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design…

Information Retrieval · Computer Science 2025-06-24 Mahmoud Amiri , Thomas Bocklitz

Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent…

Information Retrieval · Computer Science 2025-12-02 Aparajitha Allamraju , Maitreya Prafulla Chitale , Hiranmai Sri Adibhatla , Rahul Mishra , Manish Shrivastava

Standard Retrieval-Augmented Generation (RAG) chunking methods often create excessive redundancy, increasing storage costs and slowing retrieval. This study explores chunk filtering strategies, such as semantic, topic-based, and…

Computation and Language · Computer Science 2026-04-28 Daria Berdyugina , Anaëlle Cohen , Yohann Rioual

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and…

Information Retrieval · Computer Science 2025-05-30 Sinchana Ramakanth Bhat , Max Rudat , Jannis Spiekermann , Nicolas Flores-Herr

Retrieval-augmented generation (RAG) has recently demonstrated considerable potential for repository-level code completion, as it integrates cross-file knowledge with in-file preceding code to provide comprehensive contexts for generation.…

Software Engineering · Computer Science 2025-08-11 Yanzhou Li , Shangqing Liu , Kangjie Chen , Tianwei Zhang , Yang Liu

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation…

Computation and Language · Computer Science 2025-05-27 Jihao Zhao , Zhiyuan Ji , Zhaoxin Fan , Hanyu Wang , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

The performance of Retrieval-Augmented Generation (RAG) systems in information retrieval is significantly influenced by the characteristics of the documents being processed. In this study, the structured nature of textbooks, the conciseness…

Information Retrieval · Computer Science 2024-09-23 Esmaeil Narimissa , David Raithel

Recent advances in Retrieval-Augmented Generation (RAG) systems have popularized semantic chunking, which aims to improve retrieval performance by dividing documents into semantically coherent segments. Despite its growing adoption, the…

Computation and Language · Computer Science 2024-10-18 Renyi Qu , Ruixuan Tu , Forrest Bao

Retrieval-Augmented Generation (RAG) enhances the response capabilities of language models by integrating external knowledge sources. However, document chunking as an important part of RAG system often lacks effective evaluation tools. This…

Computation and Language · Computer Science 2025-10-10 Wensheng Lu , Keyu Chen , Ruizhi Qiao , Xing Sun

Chunking information is a key step in Retrieval Augmented Generation (RAG). Current research primarily centers on paragraph-level chunking. This approach treats all texts as equal and neglects the information contained in the structure of…

Computation and Language · Computer Science 2024-03-19 Antonio Jimeno Yepes , Yao You , Jan Milczek , Sebastian Laverde , Renyu Li

Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of large language models (LLMs). In recent years, retrieval-augmented generation…

Software Engineering · Computer Science 2025-07-25 Zezhou Yang , Ting Peng , Cuiyun Gao , Chaozheng Wang , Hailiang Huang , Yuetang Deng

Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully…

Information Retrieval · Computer Science 2026-04-08 Uday Allu , Sonu Kedia , Tanmay Odapally , Biddwan Ahmed

We study retrieval design for code-focused generation tasks under realistic compute budgets. Using two complementary tasks from Long Code Arena -- code completion and bug localization -- we systematically compare retrieval configurations…

Machine Learning · Computer Science 2025-10-24 Timur Galimzyanov , Olga Kolomyttseva , Egor Bogomolov

Code completion, a crucial practice in industrial settings, helps developers improve programming efficiency by automatically suggesting code snippets during development. With the emergence of Large Code Models (LCMs), this field has…

Software Engineering · Computer Science 2025-05-22 Chaozheng Wang , Zezhou Yang , Shuzheng Gao , Cuiyun Gao , Ting Peng , Hailiang Huang , Yuetang Deng , Michael Lyu

We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning…

Computation and Language · Computer Science 2026-03-10 Muhammad Arslan Shaukat , Muntasir Adnan , Carlos C. N. Kuhn
‹ Prev 1 2 3 10 Next ›