English

Topical: Learning Repository Embeddings from Source Code using Attention

Software Engineering 2023-11-07 v4 Artificial Intelligence

Abstract

This paper presents Topical, a novel deep neural network for repository level embeddings. Existing methods, reliant on natural language documentation or naive aggregation techniques, are outperformed by Topical's utilization of an attention mechanism. This mechanism generates repository-level representations from source code, full dependency graphs, and script level textual data. Trained on publicly accessible GitHub repositories, Topical surpasses multiple baselines in tasks such as repository auto-tagging, highlighting the attention mechanism's efficacy over traditional aggregation methods. Topical also demonstrates scalability and efficiency, making it a valuable contribution to repository-level representation computation. For further research, the accompanying tools, code, and training dataset are provided at: https://github.com/jpmorganchase/topical.

Keywords

Cite

@article{arxiv.2208.09495,
  title  = {Topical: Learning Repository Embeddings from Source Code using Attention},
  author = {Agathe Lherondelle and Varun Babbar and Yash Satsangi and Fran Silavong and Shaltiel Eloul and Sean Moran},
  journal= {arXiv preprint arXiv:2208.09495},
  year   = {2023}
}

Comments

Pre-print, under review

R2 v1 2026-06-25T01:49:46.838Z