Agentic Repository Mining: A Multi-Task Evaluation

Johannes Härtel

Agentic Repository Mining: A Multi-Task Evaluation

Software Engineering 2026-05-07 v1

Authors: Johannes Härtel

Abstract

Mining software repositories often requires classifying artifacts like commits, reviews, code lines, or entire repositories into categories. Human labeling is expensive and error-prone; limited context frequently leads to misclassifications or uncertainty in labels. We investigate whether LLM agents that dynamically explore repositories through standard bash commands can match the classification quality of simple LLMs that receive pre-engineered context. Across four tasks, eight approach configurations, and 4943 classifications, agents achieve competitive accuracy despite retrieving their own context. The primary advantage is robustness: agents avoid context-window overflows and scale independently of artifact size. A manual diagnosis of 100 cases where approaches disagree with the ground truth reveals specification ambiguities and labels produced under limited context, suggesting that accuracy against such ground truth may underestimate approaches with broader context access.

Keywords

llm agents benchmarking long-context modeling

Cite

@article{arxiv.2605.04845,
  title  = {Agentic Repository Mining: A Multi-Task Evaluation},
  author = {Johannes Härtel},
  journal= {arXiv preprint arXiv:2605.04845},
  year   = {2026}
}

Comments

Accepted at the 30th International Conference on Evaluation and Assessment in Software Engineering (EASE 2026). 11 pages

Agentic Repository Mining: A Multi-Task Evaluation

Abstract

Keywords

Cite

Comments

Related papers