Related papers: Semantic Source Code Models Using Identifier Embed…

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques…

Software Engineering · Computer Science 2019-03-15 Rafael-Michael Karampatsis , Charles Sutton

On the Effect of Semantically Enriched Context Models on Software Modularization

Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies…

Software Engineering · Computer Science 2017-08-08 Amir Saeidi , Jurriaan Hage , Ravi Khadka , Slinger Jansen

IdBench: Evaluating Semantic Representations of Identifier Names in Source Code

Identifier names convey useful information about the intended semantics of code. Name-based program analyses use this information, e.g., to detect bugs, to predict types, and to improve the readability of code. At the core of name-based…

Machine Learning · Computer Science 2021-01-15 Yaza Wainakh , Moiz Rauf , Michael Pradel

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and…

Machine Learning · Computer Science 2019-04-08 Zimin Chen , Martin Monperrus

Learning Semantic Vector Representations of Source Code via a Siamese Neural Network

The abundance of open-source code, coupled with the success of recent advances in deep learning for natural language processing, has given rise to a promising new application of machine learning to source code. In this work, we explore the…

Machine Learning · Computer Science 2019-04-29 David Wehr , Halley Fede , Eleanor Pence , Bo Zhang , Guilherme Ferreira , John Walczyk , Joseph Hughes

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and…

Computation and Language · Computer Science 2022-01-28 Chen Wu , Ming Yan

Towards Demystifying Dimensions of Source Code Embeddings

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with…

Machine Learning · Computer Science 2022-06-17 Md Rafiqul Islam Rabin , Arjun Mukherjee , Omprakash Gnawali , Mohammad Amin Alipour

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers

Identifiers make up a majority of the text in code. They are one of the most basic mediums through which developers describe the code they create and understand the code that others create. Therefore, understanding the patterns latent in…

Software Engineering · Computer Science 2020-07-17 Christian D. Newman , Reem S. AlSuhaibani , Michael J. Decker , Anthony Peruma , Dishant Kaushik , Mohamed Wiem Mkaouer , Emily Hill

code2vec: Learning Distributed Representations of Code

We present a neural model for representing snippets of code as continuous distributed vectors ("code embeddings"). The main idea is to represent a code snippet as a single fixed-length $\textit{code vector}$, which can be used to predict…

Machine Learning · Computer Science 2018-10-31 Uri Alon , Meital Zilberstein , Omer Levy , Eran Yahav

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster.…

Software Engineering · Computer Science 2021-04-19 Abdullah Al Ishtiaq , Masum Hasan , Md. Mahim Anjum Haque , Kazi Sajeed Mehrab , Tanveer Muttaqueen , Tahmid Hasan , Anindya Iqbal , Rifat Shahriyar

Semantic Code Graph -- an information model to facilitate software comprehension

Software comprehension can be extremely time-consuming due to the ever-growing size of codebases. Consequently, there is an increasing need to accelerate the code comprehension process to facilitate maintenance and reduce associated costs.…

Software Engineering · Computer Science 2024-01-15 Krzysztof Borowski , Bartosz Baliś , Tomasz Orzechowski

Neural Code Comprehension: A Learnable Representation of Code Semantics

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation,…

Machine Learning · Computer Science 2018-11-30 Tal Ben-Nun , Alice Shoshana Jakobovits , Torsten Hoefler

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new…

Machine Learning · Computer Science 2019-06-05 Ben Gelman , Bryan Hoyle , Jessica Moore , Joshua Saxe , David Slater

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi

Import2vec - Learning Embeddings for Software Libraries

We consider the problem of developing suitable learning representations (embeddings) for library packages that capture semantic similarity among libraries. Such representations are known to improve the performance of downstream learning…

Software Engineering · Computer Science 2019-04-09 Bart Theeten , Frederik Vandeputte , Tom Van Cutsem

Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code

Statistical language modeling techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major…

Software Engineering · Computer Science 2020-03-19 Rafael-Michael Karampatsis , Hlib Babii , Romain Robbes , Charles Sutton , Andrea Janes

Text-to-Code Generation with Modality-relative Pre-training

Large pre-trained language models have recently been expanded and applied to programming language tasks with great success, often through further pre-training of a strictly-natural language model--where training sequences typically contain…

Computation and Language · Computer Science 2024-02-13 Fenia Christopoulou , Guchun Zhang , Gerasimos Lampouras

How are identifiers named in open source software? About popularity and consistency

With the rapid increasing of software project size and maintenance cost, adherence to coding standards especially by managing identifier naming, is attracting a pressing concern from both computer science educators and software managers.…

Software Engineering · Computer Science 2014-06-03 Yanqing Wang , Chong Wang , Xiaojie Li , Sijing Yun , Minjing Song

Commit2Vec: Learning Distributed Representations of Code Changes

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available…

Software Engineering · Computer Science 2021-11-18 Rocìo Cabrera Lozoya , Arnaud Baumann , Antonino Sabetta , Michele Bezzi