Related papers: Improving Source Code Similarity Detection Through…

Augmenting the Interpretability of GraphCodeBERT for Code Similarity Tasks

Assessing the degree of similarity of code fragments is crucial for ensuring software quality, but it remains challenging due to the need to capture the deeper semantic aspects of code. Traditional syntactic methods often fail to identify…

Information Retrieval · Computer Science 2025-04-14 Jorge Martinez-Gil

Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures

The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces…

Software Engineering · Computer Science 2025-04-25 Jorge Martinez-Gil

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code…

Software Engineering · Computer Science 2021-09-14 Daya Guo , Shuo Ren , Shuai Lu , Zhangyin Feng , Duyu Tang , Shujie Liu , Long Zhou , Nan Duan , Alexey Svyatkovskiy , Shengyu Fu , Michele Tufano , Shao Kun Deng , Colin Clement , Dawn Drain , Neel Sundaresan , Jian Yin , Daxin Jiang , Ming Zhou

Generalizability of Code Clone Detection on CodeBERT

Transformer networks such as CodeBERT already achieve outstanding results for code clone detection in benchmark datasets, so one could assume that this task has already been solved. However, code clone detection is not a trivial task.…

Software Engineering · Computer Science 2022-09-02 Tim Sonnekalb , Bernd Gruner , Clemens-Alexander Brust , Patrick Mäder

Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection

Source code clone detection is the task of finding code fragments that have the same or similar functionality, but may differ in syntax or structure. This task is important for software maintenance, reuse, and quality assurance (Roy et al.…

Computation and Language · Computer Science 2023-12-29 Mohammed Ataaur Rahaman , Julia Ive

GN-Transformer: Fusing Sequence and Graph Representation for Improved Code Summarization

As opposed to natural languages, source code understanding is influenced by grammatical relationships between tokens regardless of their identifier name. Graph representations of source code such as Abstract Syntax Tree (AST) can capture…

Machine Learning · Computer Science 2021-11-18 Junyan Cheng , Iordanis Fostiropoulos , Barry Boehm

Scalable Source Code Similarity Detection in Large Code Repositories

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must…

Software Engineering · Computer Science 2019-07-30 F Alomari , M Harbi

Gitor: Scalable Code Clone Detection by Building Global Sample Graph

Code clone detection is about finding out similar code fragments, which has drawn much attention in software engineering since it is important for software maintenance and evolution. Researchers have proposed many techniques and tools for…

Software Engineering · Computer Science 2023-11-21 Junjie Shan , Shihan Dou , Yueming Wu , Hairu Wu , Yang Liu

What do pre-trained code models know about code?

Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from…

Software Engineering · Computer Science 2021-08-26 Anjan Karmakar , Romain Robbes

INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers

Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little…

Software Engineering · Computer Science 2023-12-11 Anjan Karmakar , Romain Robbes

Enhancing Code Vulnerability Detection via Vulnerability-Preserving Data Augmentation

Source code vulnerability detection aims to identify inherent vulnerabilities to safeguard software systems from potential attacks. Many prior studies overlook diverse vulnerability characteristics, simplifying the problem into a binary…

Cryptography and Security · Computer Science 2024-04-16 Shangqing Liu , Wei Ma , Jian Wang , Xiaofei Xie , Ruitao Feng , Yang Liu

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

Probing Semantic Grounding in Language Models of Code with Representational Similarity Analysis

Representational Similarity Analysis is a method from cognitive neuroscience, which helps in comparing representations from two different sources of data. In this paper, we propose using Representational Similarity Analysis to probe the…

Computation and Language · Computer Science 2022-07-19 Shounak Naik , Rajaswa Patil , Swati Agarwal , Veeky Baths

A Combined Feature Embedding Tools for Multi-Class Software Defect and Identification

In software, a vulnerability is a defect in a program that attackers might utilize to acquire unauthorized access, alter system functions, and acquire information. These vulnerabilities arise from programming faults, design flaws, incorrect…

Software Engineering · Computer Science 2024-11-28 Md. Fahim Sultan , Tasmin Karim , Md. Shazzad Hossain Shaon , Mohammad Wardat , Mst Shapna Akter

Enhancing Source Code Classification Effectiveness via Prompt Learning Incorporating Knowledge Features

Researchers have investigated the potential of leveraging pre-trained language models, such as CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT's '[CLS]' token as the embedding representation of…

Computation and Language · Computer Science 2024-09-04 Yong Ma , Senlin Luo , Yu-Ming Shang , Yifei Zhang , Zhengjun Li

I Know Who Clones Your Code: Interpretable Smart Contract Similarity Detection

Widespread reuse of open-source code in smart contract development boosts programming efficiency but significantly amplifies bug propagation across contracts, while dedicated methods for detecting similar smart contract functions remain…

Software Engineering · Computer Science 2025-09-12 Zhenguang Liu , Lixun Ma , Zhongzheng Mu , Chengkun Wei , Xiaojun Xu , Yingying Jiao , Kui Ren

Learning code summarization from a small and local dataset

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of…

Software Engineering · Computer Science 2022-06-03 Toufique Ahmed , Premkumar Devanbu

Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

Understanding the functional (dis)-similarity of source code is significant for code modeling tasks such as software vulnerability and code clone detection. We present DISCO(DIS-similarity of COde), a novel self-supervised model focusing on…

Programming Languages · Computer Science 2022-03-22 Yangruibo Ding , Luca Buratti , Saurabh Pujar , Alessandro Morari , Baishakhi Ray , Saikat Chakraborty

Source Code Retrieval Using Sequence Based Similarity

Duplicated code has a negative impact on the quality of software systems and should be detected at least. In this paper, we discuss an approach that improves source code retrieval using the structural information about the programs. We…

Software Engineering · Computer Science 2013-08-19 Yoshihisa Udagawa

Evaluating Small-Scale Code Models for Code Clone Detection

Detecting code clones is relevant to software maintenance and code refactoring. This challenge still presents unresolved cases, mainly when structural similarity does not reflect functional equivalence, though recent code models show…

Software Engineering · Computer Science 2025-06-16 Jorge Martinez-Gil