Towards Understanding What Code Language Models Learned

Toufique Ahmed; Dian Yu; Chengxuan Huang; Cathy Wang; Prem Devanbu; Kenji Sagae

Towards Understanding What Code Language Models Learned

Software Engineering 2024-02-29 v2 Computation and Language Machine Learning

Authors: Toufique Ahmed , Dian Yu , Chengxuan Huang , Cathy Wang , Prem Devanbu , Kenji Sagae

Abstract

Pre-trained language models are effective in a variety of natural language tasks, but it has been argued their capabilities fall short of fully learning meaning or understanding language. To understand the extent to which language models can learn some form of meaning, we investigate their ability to capture semantics of code beyond superficial frequency and co-occurrence. In contrast to previous research on probing models for linguistic features, we study pre-trained models in a setting that allows for objective and straightforward evaluation of a model's ability to learn semantics. In this paper, we examine whether such models capture the semantics of code, which is precisely and formally defined. Through experiments involving the manipulation of code fragments, we show that code pre-trained models of code learn a robust representation of the computational semantics of code that goes beyond superficial features of form alone

Keywords

language modeling pre-trained language model natural language parsing

Cite

@article{arxiv.2306.11943,
  title  = {Towards Understanding What Code Language Models Learned},
  author = {Toufique Ahmed and Dian Yu and Chengxuan Huang and Cathy Wang and Prem Devanbu and Kenji Sagae},
  journal= {arXiv preprint arXiv:2306.11943},
  year   = {2024}
}

Towards Understanding What Code Language Models Learned

Abstract

Keywords

Cite

Related papers