English

A Language-Agnostic Model for Semantic Source Code Labeling

Machine Learning 2019-06-05 v1 Computation and Language Software Engineering Machine Learning

Abstract

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6%. This strongly indicates that the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

Keywords

Cite

@article{arxiv.1906.01032,
  title  = {A Language-Agnostic Model for Semantic Source Code Labeling},
  author = {Ben Gelman and Bryan Hoyle and Jessica Moore and Joshua Saxe and David Slater},
  journal= {arXiv preprint arXiv:1906.01032},
  year   = {2019}
}

Comments

MASES 2018 Publication