Related papers: DeepSCC: Source Code Classification Based on Fine-…

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the…

Software Engineering · Computer Science 2018-09-24 Kamel Alreshedy , Dhanush Dharmaretnam , Daniel M. German , Venkatesh Srinivasan , T. Aaron Gulliver

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi

A Language-Agnostic Model for Semantic Source Code Labeling

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new…

Machine Learning · Computer Science 2019-06-05 Ben Gelman , Bryan Hoyle , Jessica Moore , Joshua Saxe , David Slater

Machine Learning Based Source Code Classification Using Syntax Oriented Features

As of today the programming language of the vast majority of the published source code is manually specified or programmatically assigned based on the sole file extension. In this paper we show that the source code programming language…

Machine Learning · Computer Science 2017-03-23 Shaul Zevin , Catherine Holzem

Enhancing Source Code Classification Effectiveness via Prompt Learning Incorporating Knowledge Features

Researchers have investigated the potential of leveraging pre-trained language models, such as CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT's '[CLS]' token as the embedding representation of…

Computation and Language · Computer Science 2024-09-04 Yong Ma , Senlin Luo , Yu-Ming Shang , Yifei Zhang , Zhengjun Li

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques…

Software Engineering · Computer Science 2019-03-15 Rafael-Michael Karampatsis , Charles Sutton

Feature Engineering-Based Detection of Buffer Overflow Vulnerability in Source Code Using Neural Networks

One of the most significant challenges in the field of software code auditing is the presence of vulnerabilities in software source code. Every year, more and more software flaws are discovered, either internally in proprietary code or…

Cryptography and Security · Computer Science 2023-06-16 Mst Shapna Akter , Hossain Shahriar , Juan Rodriguez Cardenas , Sheikh Iqbal Ahamed , Alfredo Cuzzocrea

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It…

Computation and Language · Computer Science 2024-09-26 Nathanaël Beau , Benoît Crabbé

Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly…

Software Engineering · Computer Science 2018-09-24 Kamel Alreshedy , Dhanush Dharmaretnam , Daniel M. German , Venkatesh Srinivasan , T. Aaron Gulliver

Benchmarking Large Language Models for Log Analysis, Security, and Interpretation

Large Language Models (LLM) continue to demonstrate their utility in a variety of emergent capabilities in different fields. An area that could benefit from effective language understanding in cybersecurity is the analysis of log files.…

Networking and Internet Architecture · Computer Science 2023-11-27 Egil Karlsen , Xiao Luo , Nur Zincir-Heywood , Malcolm Heywood

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

Recommending Code Improvements Based on Stack Overflow Answer Edits

Background: Sub-optimal code is prevalent in software systems. Developers may write low-quality code due to many reasons, such as lack of technical knowledge, lack of experience, time pressure, management decisions, and even unhappiness.…

Software Engineering · Computer Science 2022-04-15 Chaiyong Ragkhitwetsagul , Matheus Paixao

Optimizing Deep Learning Models to Address Class Imbalance in Code Comment Classification

Developers rely on code comments to document their work, track issues, and understand the source code. As such, comments provide valuable insights into developers' understanding of their code and describe their various intentions in writing…

Software Engineering · Computer Science 2025-07-03 Moritz Mock , Thomas Borsani , Giuseppe Di Fatta , Barbara Russo

Logical Segmentation of Source Code

Many software analysis methods have come to rely on machine learning approaches. Code segmentation - the process of decomposing source code into meaningful blocks - can augment these methods by featurizing code, reducing noise, and limiting…

Software Engineering · Computer Science 2019-07-23 Jacob Dormuth , Ben Gelman , Jessica Moore , David Slater

VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection

This paper presents VulBERTa, a deep learning approach to detect security vulnerabilities in source code. Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects. The…

Cryptography and Security · Computer Science 2023-06-21 Hazim Hanif , Sergio Maffeis

Semantic Source Code Segmentation using Small and Large Language Models

Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and…

Software Engineering · Computer Science 2025-07-15 Abdelhalim Dahou , Ansgar Scherp , Sebastian Kurten , Brigitte Mathiak , Madhu Chauhan

Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

Software clones are beneficial to detect security gaps and software maintenance in one programming language or across multiple languages. The existing work on source clone detection performs well but in a single programming language.…

Software Engineering · Computer Science 2022-05-11 Mohammad A. Yahya , Dae-Kyoo Kim

Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection

Modern software relies on a multitude of automated testing and quality assurance tools to prevent errors, bugs and potential vulnerabilities. This study sets out to provide a head-to-head, quantitative and qualitative evaluation of six…

Software Engineering · Computer Science 2025-08-07 Damian Gnieciak , Tomasz Szandala

Secret Breach Detection in Source Code with Large Language Models

Background: Leaking sensitive information - such as API keys, tokens, and credentials - in source code remains a persistent security threat. Traditional regex and entropy-based tools often generate high false positives due to limited…

Software Engineering · Computer Science 2025-07-29 Md Nafiu Rahman , Sadif Ahmed , Zahin Wahab , S M Sohan , Rifat Shahriyar

Using StackOverflow content to assist in code review

An important goal for programmers is to minimize cost of identifying and correcting defects in source code. Code review is commonly used for identifying programming defects. However, manual code review has some shortcomings: a) it is time…

Software Engineering · Computer Science 2018-09-13 Balwinder Sodhi , Shipra Sharma