Related papers: Towards Learning (Dis)-Similarity of Source Code f…

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE…

Software Engineering · Computer Science 2023-06-07 Yangruibo Ding , Saikat Chakraborty , Luca Buratti , Saurabh Pujar , Alessandro Morari , Gail Kaiser , Baishakhi Ray

Source Code Clone Detection Using Unsupervised Similarity Measures

Assessing similarity in source code has gained significant attention in recent years due to its importance in software engineering tasks such as clone detection and code search and recommendation. This work presents a comparative analysis…

Software Engineering · Computer Science 2024-08-13 Jorge Martinez-Gil

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Code clones are pairs of code snippets that implement similar functionality. Clone detection is a fundamental branch of automatic source code comprehension, having many applications in refactoring recommendation, plagiarism detection, and…

Software Engineering · Computer Science 2022-06-20 Maksim Zubkov , Egor Spirin , Egor Bogomolov , Timofey Bryksin

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used…

Software Engineering · Computer Science 2021-05-25 Nghi D. Q. Bui , Yijun Yu , Lingxiao Jiang

Scalable Program Clone Search Through Spectral Analysis

We consider the problem of program clone search, i.e. given a target program and a repository of known programs (all in executable format), the goal is to find the program in the repository most similar to the target program - with…

Cryptography and Security · Computer Science 2023-09-04 Tristan Benoit , Jean-Yves Marion , Sébastien Bardin

Modeling Functional Similarity in Source Code with Graph-Based Siamese Networks

Code clones are duplicate code fragments that share (nearly) similar syntax or semantics. Code clone detection plays an important role in software maintenance, code refactoring, and reuse. A substantial amount of research has been conducted…

Software Engineering · Computer Science 2020-11-26 Nikita Mehrotra , Navdha Agarwal , Piyush Gupta , Saket Anand , David Lo , Rahul Purandare

Cross-Language Source Code Clone Detection Using Deep Learning with InferCode

Software clones are beneficial to detect security gaps and software maintenance in one programming language or across multiple languages. The existing work on source clone detection performs well but in a single programming language.…

Software Engineering · Computer Science 2022-05-11 Mohammad A. Yahya , Dae-Kyoo Kim

Detecting Code Clones: A review

Code clone detection is involved with detecting duplicated fragments of code within a code base. Detecting these clones is useful for maintenance operations which require editing the clones. The tools developed are expected to be robust…

Software Engineering · Computer Science 2016-05-10 Ogechi Onuoha

Addressing Leakage in Self-Supervised Contextualized Code Retrieval

We address contextualized code retrieval, the search for code snippets helpful to fill gaps in a partial input program. Our approach facilitates a large-scale self-supervised contrastive training by splitting source code randomly into…

Software Engineering · Computer Science 2022-04-26 Johannes Villmow , Viola Campos , Adrian Ulges , Ulrich Schwanecke

DISCO: A Browser-Based Privacy-Preserving Framework for Distributed Collaborative Learning

Data is often impractical to share for a range of well considered reasons, such as concerns over privacy, intellectual property, and legal constraints. This not only fragments the statistical power of predictive models, but creates an…

Machine Learning · Computer Science 2025-11-26 Julien T. T. Vignoud , Valérian Rousset , Hugo El Guedj , Ignacio Aleman , Walid Bennaceur , Batuhan Faik Derinbay , Eduard Ďurech , Damien Gengler , Lucas Giordano , Felix Grimberg , Franziska Lippoldt , Christina Kopidaki , Jiafan Liu , Lauris Lopata , Nathan Maire , Paul Mansat , Martin Milenkoski , Emmanuel Omont , Güneş Özgün , Mina Petrović , Francesco Posa , Morgan Ridel , Giorgio Savini , Marcel Torne , Lucas Trognon , Alyssa Unell , Olena Zavertiaieva , Sai Praneeth Karimireddy , Tahseen Rabbani , Mary-Anne Hartley , Martin Jaggi

Use of Source Code Similarity Metrics in Software Defect Prediction

In recent years, defect prediction has received a great deal of attention in the empirical software engineering world. Predicting software defects before the maintenance phase is very important not only to decrease the maintenance costs but…

Software Engineering · Computer Science 2018-08-31 Ahmet Okutan

CodeS: Towards Code Model Generalization Under Distribution Shift

Distribution shift has been a longstanding challenge for the reliable deployment of deep learning (DL) models due to unexpected accuracy degradation. Although DL has been becoming a driving force for large-scale source code analysis in the…

Software Engineering · Computer Science 2023-02-07 Qiang Hu , Yuejun Guo , Xiaofei Xie , Maxime Cordy , Lei Ma , Mike Papadakis , Yves Le Traon

Detecting Semantic Clones of Unseen Functionality

Semantic code clone detection is the task of detecting whether two snippets of code implement the same functionality (e.g., Sort Array). Recently, many neural models achieved near-perfect performance on this task. These models seek to make…

Software Engineering · Computer Science 2025-12-02 Konstantinos Kitsios , Francesco Sovrano , Earl T. Barr , Alberto Bacchelli

Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities

Discerning between authentic content and that generated by advanced AI methods has become increasingly challenging. While previous research primarily addresses the detection of fake faces, the identification of generated natural images has…

Computer Vision and Pattern Recognition · Computer Science 2024-07-31 Lorenzo Baraldi , Federico Cocchi , Marcella Cornia , Lorenzo Baraldi , Alessandro Nicolosi , Rita Cucchiara

Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program…

Machine Learning · Computer Science 2022-01-10 Paras Jain , Ajay Jain , Tianjun Zhang , Pieter Abbeel , Joseph E. Gonzalez , Ion Stoica

DeepClone: Modeling Clones to Generate Code Predictions

Programmers often reuse code from source code repositories to reduce the development effort. Code clones are candidates for reuse in exploratory or rapid development, as they represent often repeated functionality in software systems. To…

Software Engineering · Computer Science 2020-12-08 Muhammad Hammad , Önder Babur , Hamid Abdul Basit , Mark van den Brand

Source Code Retrieval Using Sequence Based Similarity

Duplicated code has a negative impact on the quality of software systems and should be detected at least. In this paper, we discuss an approach that improves source code retrieval using the structural information about the programs. We…

Software Engineering · Computer Science 2013-08-19 Yoshihisa Udagawa

Scalable Source Code Similarity Detection in Large Code Repositories

Source code similarity are increasingly used in application development to identify clones, isolate bugs, and find copy-rights violations. Similar code fragments can be very problematic due to the fact that errors in the original code must…

Software Engineering · Computer Science 2019-07-30 F Alomari , M Harbi

SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings

Function-level binary code similarity detection is a crucial aspect of cybersecurity. It enables the detection of bugs and patent infringements in released software and plays a pivotal role in preventing supply chain attacks. A practical…

Cryptography and Security · Computer Science 2023-12-27 Sun RuiJin , Guo Shize , Guo Jinhong , Li Wei , Zhan Dazhi , Sun Meng , Pan Zhisong

Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features

This paper investigates source code similarity detection using a transformer model augmented with an execution-derived signal. We extend GraphCodeBERT with an explicit, low-dimensional behavioral feature that captures observable agreement…

Software Engineering · Computer Science 2026-02-11 Jorge Martinez-Gil