VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

Hazim Hanif; Sergio Maffeis

doi:10.1109/IJCNN55064.2022.9892280

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

Cryptography and Security 2023-06-21 v1 Artificial Intelligence Machine Learning

Authors: Hazim Hanif , Sergio Maffeis

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

Keywords

vulnerability detection binary analysis verification

Cite

@article{arxiv.2205.12424,
  title  = {VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection},
  author = {Hazim Hanif and Sergio Maffeis},
  journal= {arXiv preprint arXiv:2205.12424},
  year   = {2023}
}

Comments

Accepted as a conference paper at IJCNN 2022

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

Abstract

Keywords

Cite

Comments

Related papers