English

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

Cryptography and Security 2023-06-21 v1 Artificial Intelligence Machine Learning

Abstract

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The model learns a deep knowledge representation of the code syntax and semantics, which we leverage to train vulnerability detection classifiers. We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets (Vuldeepecker, Draper, REVEAL and muVuldeepecker) and benchmarks (CodeXGLUE and D2A). The evaluation results show that VulBERTa achieves state-of-the-art performance and outperforms existing approaches across different datasets, despite its conceptual simplicity, and limited cost in terms of size of training data and number of model parameters.

Keywords

Cite

@article{arxiv.2205.12424,
  title  = {VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection},
  author = {Hazim Hanif and Sergio Maffeis},
  journal= {arXiv preprint arXiv:2205.12424},
  year   = {2023}
}

Comments

Accepted as a conference paper at IJCNN 2022

R2 v1 2026-06-24T11:27:45.733Z