Related papers: A Language-Agnostic Model for Semantic Source Code…

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the…

Software Engineering · Computer Science 2018-09-24 Kamel Alreshedy , Dhanush Dharmaretnam , Daniel M. German , Venkatesh Srinivasan , T. Aaron Gulliver

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

In software engineering-related tasks (such as programming language tag prediction based on code snippets from Stack Overflow), the programming language classification for code snippets is a common task. In this study, we propose a novel…

Software Engineering · Computer Science 2021-10-05 Guang Yang , Yanlin Zhou , Chi Yu , Xiang Chen

Using StackOverflow content to assist in code review

An important goal for programmers is to minimize cost of identifying and correcting defects in source code. Code review is commonly used for identifying programming defects. However, manual code review has some shortcomings: a) it is time…

Software Engineering · Computer Science 2018-09-13 Balwinder Sodhi , Shipra Sharma

Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly…

Software Engineering · Computer Science 2018-09-24 Kamel Alreshedy , Dhanush Dharmaretnam , Daniel M. German , Venkatesh Srinivasan , T. Aaron Gulliver

Content-Based Textual File Type Detection at Scale

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider…

Software Engineering · Computer Science 2021-03-02 Francesca Del Bonifro , Maurizio Gabbrielli , Stefano Zacchiroli

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques…

Software Engineering · Computer Science 2019-03-15 Rafael-Michael Karampatsis , Charles Sutton

Learning Deep Semantic Model for Code Search using CodeSearchNet Corpus

Semantic code search is the task of retrieving relevant code snippet given a natural language query. Different from typical information retrieval tasks, code search requires to bridge the semantic gap between the programming language and…

Computation and Language · Computer Science 2022-01-28 Chen Wu , Ming Yan

A Convolutional Neural Network for Language-Agnostic Source Code Summarization

Descriptive comments play a crucial role in the software engineering process. They decrease development time, enable better bug detection, and facilitate the reuse of previously written code. However, comments are commonly the last of a…

Computation and Language · Computer Science 2019-04-02 Jessica Moore , Ben Gelman , David Slater

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We…

Computer Vision and Pattern Recognition · Computer Science 2022-07-20 Shiyu Zhao , Zhixing Zhang , Samuel Schulter , Long Zhao , Vijay Kumar B. G , Anastasis Stathopoulos , Manmohan Chandraker , Dimitris Metaxas

Machine Learning Based Source Code Classification Using Syntax Oriented Features

As of today the programming language of the vast majority of the published source code is manually specified or programmatically assigned based on the sole file extension. In this paper we show that the source code programming language…

Machine Learning · Computer Science 2017-03-23 Shaul Zevin , Catherine Holzem

Semantic Source Code Models Using Identifier Embeddings

The emergence of online open source repositories in the recent years has led to an explosion in the volume of openly available source code, coupled with metadata that relate to a variety of software development activities. As an effect, in…

Software Engineering · Computer Science 2023-12-05 Vasiliki Efstathiou , Diomidis Spinellis

Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection

Modern software relies on a multitude of automated testing and quality assurance tools to prevent errors, bugs and potential vulnerabilities. This study sets out to provide a head-to-head, quantitative and qualitative evaluation of six…

Software Engineering · Computer Science 2025-08-07 Damian Gnieciak , Tomasz Szandala

Unsupervised Model Adaptation for Continual Semantic Segmentation

We develop an algorithm for adapting a semantic segmentation model that is trained using a labeled source domain to generalize well in an unlabeled target domain. A similar problem has been studied extensively in the unsupervised domain…

Machine Learning · Computer Science 2021-01-12 Serban Stan , Mohammad Rostami

Automatic Labeling of the Object-oriented Source Code: The Lotus Approach

Most of open-source software systems become available on the internet today. Thus, we need automatic methods to label software code. Software code can be labeled with a set of keywords. These keywords in this paper referred as software…

Software Engineering · Computer Science 2018-03-02 Ra'Fat Al-Msie'deen

Semantic Code Graph -- an information model to facilitate software comprehension

Software comprehension can be extremely time-consuming due to the ever-growing size of codebases. Consequently, there is an increasing need to accelerate the code comprehension process to facilitate maintenance and reduce associated costs.…

Software Engineering · Computer Science 2024-01-15 Krzysztof Borowski , Bartosz Baliś , Tomasz Orzechowski

Precise Learning of Source Code Contextual Semantics via Hierarchical Dependence Structure and Graph Attention Networks

Deep learning is being used extensively in a variety of software engineering tasks, e.g., program classification and defect prediction. Although the technique eliminates the required process of feature engineering, the construction of…

Software Engineering · Computer Science 2021-11-24 Zhehao Zhao , Bo Yang , Ge Li , Huai Liu , Zhi Jin

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with…

Computation and Language · Computer Science 2018-05-24 Pengcheng Yin , Bowen Deng , Edgar Chen , Bogdan Vasilescu , Graham Neubig

Label Smoothing Improves Neural Source Code Summarization

Label smoothing is a regularization technique for neural networks. Normally neural models are trained to an output distribution that is a vector with a single 1 for the correct prediction, and 0 for all other elements. Label smoothing…

Software Engineering · Computer Science 2023-03-29 Sakib Haque , Aakash Bansal , Collin McMillan

Label-semantics Aware Generative Approach for Domain-Agnostic Multilabel Classification

The explosion of textual data has made manual document classification increasingly challenging. To address this, we introduce a robust, efficient domain-agnostic generative model framework for multi-label text classification. Instead of…

Computation and Language · Computer Science 2025-07-22 Subhendu Khatuya , Shashwat Naidu , Saptarshi Ghosh , Pawan Goyal , Niloy Ganguly

Skill over Scale: The Case for Medium, Domain-Specific Models for SE

Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they…

Computation and Language · Computer Science 2025-02-24 Manisha Mukherjee , Vincent J. Hellendoorn