English

Algorithmic Programming Language Identification

Machine Learning 2011-11-10 v2

Abstract

Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical features. We also explored, but abandoned, a grammatical approach. In testing, our implementation greatly outperforms that of an existing tool that relies on a Bayesian classifier. Code is written in Python and available under an MIT license.

Keywords

Cite

@article{arxiv.1106.4064,
  title  = {Algorithmic Programming Language Identification},
  author = {David Klein and Kyle Murray and Simon Weber},
  journal= {arXiv preprint arXiv:1106.4064},
  year   = {2011}
}

Comments

11 pages. Code: https://github.com/simon-weber/Programming-Language-Identification

R2 v1 2026-06-21T18:25:12.430Z