Related papers: Machine Learning Based Source Code Classification …

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the…

Software Engineering · Computer Science 2018-09-24 Kamel Alreshedy , Dhanush Dharmaretnam , Daniel M. German , Venkatesh Srinivasan , T. Aaron Gulliver

Algorithmic Programming Language Identification

Motivated by the amount of code that goes unidentified on the web, we introduce a practical method for algorithmically identifying the programming language of source code. Our work is based on supervised learning and intelligent statistical…

Machine Learning · Computer Science 2011-11-10 David Klein , Kyle Murray , Simon Weber

Identifying Source Code File Experts

In software development, the identification of source code file experts is an important task. Identifying these experts helps to improve software maintenance and evolution activities, such as developing new features, code reviews, and bug…

Software Engineering · Computer Science 2022-08-17 Otávio Cury , Guilherme Avelino , Pedro Santos Neto , Ricardo Britto , Marco Túlio Valente

Content-Based Textual File Type Detection at Scale

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider…

Software Engineering · Computer Science 2021-03-02 Francesca Del Bonifro , Maurizio Gabbrielli , Stefano Zacchiroli

Natural Language-Guided Programming

In today's software world with its cornucopia of reusable software libraries, when a programmer is faced with a programming task that they suspect can be completed through the use of a library, they often look for code examples using a…

Software Engineering · Computer Science 2021-10-08 Geert Heyman , Rafael Huysegems , Pascal Justen , Tom Van Cutsem

Source Code Recommender Systems: The Practitioners' Perspective

The automatic generation of source code is one of the long-lasting dreams in software engineering research. Several techniques have been proposed to speed up the writing of new code. For example, code completion techniques can recommend to…

Software Engineering · Computer Science 2023-02-09 Matteo Ciniselli , Luca Pascarella , Emad Aghajani , Simone Scalabrino , Rocco Oliveto , Gabriele Bavota

A Survey on Machine Learning Techniques for Source Code Analysis

The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis, such as testing and vulnerability detection. Such a large number…

Software Engineering · Computer Science 2022-09-14 Tushar Sharma , Maria Kechagia , Stefanos Georgiou , Rohit Tiwari , Indira Vats , Hadi Moazen , Federica Sarro

Logical Segmentation of Source Code

Many software analysis methods have come to rely on machine learning approaches. Code segmentation - the process of decomposing source code into meaningful blocks - can augment these methods by featurizing code, reducing noise, and limiting…

Software Engineering · Computer Science 2019-07-23 Jacob Dormuth , Ben Gelman , Jessica Moore , David Slater

An Effective Approach to Embedding Source Code by Combining Large Language and Sentence Embedding Models

The advent of large language models (LLMs) has significantly advanced artificial intelligence (AI) in software engineering (SE), with source code embeddings playing a crucial role in tasks such as source code clone detection and source code…

Software Engineering · Computer Science 2025-06-04 Zixiang Xian , Chenhui Cui , Rubing Huang , Chunrong Fang , Zhenyu Chen

LLM-Aided Customizable Profiling of Code Data Based On Programming Language Concepts

Data profiling is critical in machine learning for generating descriptive statistics, supporting both deeper understanding and downstream tasks like data valuation and curation. This work addresses profiling specifically in the context of…

Software Engineering · Computer Science 2025-03-21 Pankaj Thorat , Adnan Qidwai , Adrija Dhar , Aishwariya Chakraborty , Anand Eswaran , Hima Patel , Praveen Jayachandran

An Empirical Study on Automatically Detecting AI-Generated Source Code: How Far Are We?

Artificial Intelligence (AI) techniques, especially Large Language Models (LLMs), have started gaining popularity among researchers and software developers for generating source code. However, LLMs have been shown to generate code with…

Software Engineering · Computer Science 2024-11-08 Hyunjae Suh , Mahan Tafreshipour , Jiawei Li , Adithya Bhattiprolu , Iftekhar Ahmed

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

In software engineering-related tasks (such as programming language tag prediction based on code snippets from Stack Overflow), the programming language classification for code snippets is a common task. In this study, we propose a novel…

Software Engineering · Computer Science 2021-10-05 Guang Yang , Yanlin Zhou , Chi Yu , Xiang Chen

Exploring Software Naturalness through Neural Language Models

The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing. We explore this hypothesis through the use of a pre-trained transformer-based language…

Computation and Language · Computer Science 2020-06-25 Luca Buratti , Saurabh Pujar , Mihaela Bornea , Scott McCarley , Yunhui Zheng , Gaetano Rossiello , Alessandro Morari , Jim Laredo , Veronika Thost , Yufan Zhuang , Giacomo Domeniconi

A Comparative Study of Different Source Code Metrics and Machine Learning Algorithms for Predicting Change Proneness of Object Oriented Systems

Change-prone classes or modules are defined as software components in the source code which are likely to change in the future. Change-proneness prediction is useful to the maintenance team as they can optimize and focus their testing…

Software Engineering · Computer Science 2017-12-22 Lov Kumar , Ashish Sureka

Automatic Classification of Object Code Using Machine Learning

Recent research has repeatedly shown that machine learning techniques can be applied to either whole files or file fragments to classify them for analysis. We build upon these techniques to show that for samples of un-labeled compiled…

Machine Learning · Statistics 2018-05-08 John Clemens

Enhancing Source Code Classification Effectiveness via Prompt Learning Incorporating Knowledge Features

Researchers have investigated the potential of leveraging pre-trained language models, such as CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT's '[CLS]' token as the embedding representation of…

Computation and Language · Computer Science 2024-09-04 Yong Ma , Senlin Luo , Yu-Ming Shang , Yifei Zhang , Zhengjun Li

Automatic Labeling of the Object-oriented Source Code: The Lotus Approach

Most of open-source software systems become available on the internet today. Thus, we need automatic methods to label software code. Software code can be labeled with a set of keywords. These keywords in this paper referred as software…

Software Engineering · Computer Science 2018-03-02 Ra'Fat Al-Msie'deen

Semantic Code Classification for Automated Machine Learning

A range of applications for automatic machine learning need the generation process to be controllable. In this work, we propose a way to control the output via a sequence of simple actions, that are called semantic code classes. Finally, we…

Machine Learning · Computer Science 2022-01-28 Polina Guseva , Anastasia Drozdova , Natalia Denisenko , Daria Sapozhnikova , Ivan Pyaternev , Anna Scherbakova , Andrey Ustuzhanin

A Survey of Automatic Generation of Source Code Comments: Algorithms and Techniques

As an integral part of source code files, code comments help improve program readability and comprehension. However, developers sometimes do not comment on their program code adequately due to the incurred extra efforts, lack of relevant…

Software Engineering · Computer Science 2019-07-31 Xiaotao Song , Hailong Sun , Xu Wang , Jiafei Yan

Using Software Categories for the Development of Generative Software

In model-driven development (MDD) software emerges by systematically transforming abstract models to concrete source code. Ideally, performing those transformations is to a large extent the task of code generators. One approach for…

Software Engineering · Computer Science 2015-09-09 Pedram Mir Seyed Nazari , Bernhard Rumpe