English

Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP

Software Engineering 2026-03-31 v1 Artificial Intelligence Programming Languages

Abstract

Large Language Model (LLM) coding agents typically explore codebases through repeated file-reading and grep-searching, consuming thousands of tokens per query without structural understanding. We present Codebase-Memory, an open-source system that constructs a persistent, Tree-Sitter-based knowledge graph via the Model Context Protocol (MCP), parsing 66 languages through a multi-phase pipeline with parallel worker pools, call-graph traversal, impact analysis, and community discovery. Evaluated across 31 real-world repositories, Codebase-Memory achieves 83% answer quality versus 92% for a file-exploration agent, at ten times fewer tokens and 2.1 times fewer tool calls. For graph-native queries such as hub detection and caller ranking, it matches or exceeds the explorer on 19 of 31 languages.

Keywords

Cite

@article{arxiv.2603.27277,
  title  = {Codebase-Memory: Tree-Sitter-Based Knowledge Graphs for LLM Code Exploration via MCP},
  author = {Martin Vogel and Falk Meyer-Eschenbach and Severin Kohler and Elias Grünewald and Felix Balzer},
  journal= {arXiv preprint arXiv:2603.27277},
  year   = {2026}
}

Comments

10 pages, 5 authors, preprint

R2 v1 2026-07-01T11:42:18.671Z