Related papers: Embedding Java Classes with code2vec: Improvements…

Malware Classification with Word Embedding Features

Malware classification is an important and challenging problem in information security. Modern malware classification techniques rely on machine learning models that can be trained on features such as opcode sequences, API calls, and byte…

Cryptography and Security · Computer Science 2021-03-05 Aparna Sunil Kale , Fabio Di Troia , Mark Stamp

code2vec: Learning Distributed Representations of Code

We present a neural model for representing snippets of code as continuous distributed vectors ("code embeddings"). The main idea is to represent a code snippet as a single fixed-length $\textit{code vector}$, which can be used to predict…

Machine Learning · Computer Science 2018-10-31 Uri Alon , Meital Zilberstein , Omer Levy , Eran Yahav

Bug Prediction Using Source Code Embedding Based on Doc2Vec

Bug prediction is a resource demanding task that is hard to automate using static source code analysis. In many fields of computer science, machine learning has proven to be extremely useful in tasks like this, however, for it to work we…

Software Engineering · Computer Science 2021-10-12 Tamás Aladics , Judit Jász , Rudolf Ferenc

Towards Demystifying Dimensions of Source Code Embeddings

Source code representations are key in applying machine learning techniques for processing and analyzing programs. A popular approach in representing source code is neural source code embeddings that represents programs with…

Machine Learning · Computer Science 2022-06-17 Md Rafiqul Islam Rabin , Arjun Mukherjee , Omprakash Gnawali , Mohammad Amin Alipour

Using Distributed Representation of Code for Bug Detection

Recent advances in neural modeling for bug detection have been very promising. More specifically, using snippets of code to create continuous vectors or \textit{embeddings} has been shown to be very good at method name prediction and…

Software Engineering · Computer Science 2020-05-14 Jón Arnar Briem , Jordi Smit , Hendrig Sellik , Pavel Rapoport

A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification

Word embeddings are often used in natural language processing as a means to quantify relationships between words. More generally, these same word embedding techniques can be used to quantify relationships between features. In this paper, we…

Cryptography and Security · Computer Science 2021-03-11 Aniket Chandak , Wendy Lee , Mark Stamp

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

A Controlled Experiment of Different Code Representations for Learning-Based Bug Repair

Training a deep learning model on source code has gained significant traction recently. Since such models reason about vectors of numbers, source code needs to be converted to a code representation before vectorization. Numerous approaches…

Software Engineering · Computer Science 2022-07-18 Marjane Namavar , Noor Nashid , Ali Mesbah

JConstHide: A Framework for Java Source Code Constant Hiding

Software obfuscation or obscuring a software is an approach to defeat the practice of reverse engineering a software for using its functionality illegally in the development of another software. Java applications are more amenable to…

Cryptography and Security · Computer Science 2009-04-23 Praveen Sivadasan , P Sojan Lal

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success of word embedding techniques such as word2vec. Recently, the same idea has been applied on source code with encouraging results. In this survey, we aim to collect and…

Machine Learning · Computer Science 2019-04-08 Zimin Chen , Martin Monperrus

GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses

Code embedding is a keystone in the application of machine learning on several Software Engineering (SE) tasks. To effectively support a plethora of SE tasks, the embedding needs to capture program syntax and semantics in a way that is…

Software Engineering · Computer Science 2022-01-24 Wei Ma , Mengjie Zhao , Ezekiel Soremekun , Qiang Hu , Jie Zhang , Mike Papadakis , Maxime Cordy , Xiaofei Xie , Yves Le Traon

Commit2Vec: Learning Distributed Representations of Code Changes

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available…

Software Engineering · Computer Science 2021-11-18 Rocìo Cabrera Lozoya , Arnaud Baumann , Antonino Sabetta , Michele Bezzi

Op2Vec: An Opcode Embedding Technique and Dataset Design for End-to-End Detection of Android Malware

Android is one of the leading operating systems for smart phones in terms of market share and usage. Unfortunately, it is also an appealing target for attackers to compromise its security through malicious applications. To tackle this…

Cryptography and Security · Computer Science 2022-05-31 Kaleem Nawaz Khan , Najeeb Ullah , Sikandar Ali , Muhammad Salman Khan , Mohammad Nauman , Anwar Ghani

Recommendation of Move Method Refactoring Using Path-Based Representation of Code

Software refactoring plays an important role in increasing code quality. One of the most popular refactoring types is the Move Method refactoring. It is usually applied when a method depends more on members of other classes than on its own…

Software Engineering · Computer Science 2020-02-18 Zarina Kurbatova , Ivan Veselov , Yaroslav Golubev , Timofey Bryksin

Hierarchical Learning of Cross-Language Mappings through Distributed Vector Representations for Code

Translating a program written in one programming language to another can be useful for software development tasks that need functionality implementations in different languages. Although past studies have considered this problem, they may…

Machine Learning · Computer Science 2018-03-14 Nghi D. Q. Bui , Lingxiao Jiang

Optimizing Code Embeddings and ML Classifiers for Python Source Code Vulnerability Detection

In recent years, the growing complexity and scale of source code have rendered manual software vulnerability detection increasingly impractical. To address this challenge, automated approaches leveraging machine learning and code embeddings…

Software Engineering · Computer Science 2025-09-17 Talaya Farasat , Joachim Posegga

Detecting Security Fixes in Open-Source Repositories using Static Code Analyzers

The sources of reliable, code-level information about vulnerabilities that affect open-source software (OSS) are scarce, which hinders a broad adoption of advanced tools that provide code-level detection and assessment of vulnerable OSS…

Software Engineering · Computer Science 2021-05-10 Therese Fehrer , Rocío Cabrera Lozoya , Antonino Sabetta , Dario Di Nucci , Damian A. Tamburri

On the Embeddings of Variables in Recurrent Neural Networks for Source Code

Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics…

Software Engineering · Computer Science 2021-04-28 Nadezhda Chirkova

Obfuscating Java Programs by Translating Selected Portions of Bytecode to Native Libraries

Code obfuscation is a popular approach to turn program comprehension and analysis harder, with the aim of mitigating threats related to malicious reverse engineering and code tampering. However, programming languages that compile to high…

Software Engineering · Computer Science 2019-01-16 Davide Pizzolotto , Mariano Ceccato

Variables are a Curse in Software Vulnerability Prediction

Deep learning-based approaches for software vulnerability prediction currently mainly rely on the original text of software code as the feature of nodes in the graph of code and thus could learn a representation that is only specific to the…

Software Engineering · Computer Science 2024-07-04 Jinghua Groppe , Sven Groppe , Ralf Möller