Related papers: Pre-Training Representations of Binary Code Using …

Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program…

Machine Learning · Computer Science 2022-01-10 Paras Jain , Ajay Jain , Tianjun Zhang , Pieter Abbeel , Joseph E. Gonzalez , Ion Stoica

CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are…

Software Engineering · Computer Science 2024-02-28 Hao Wang , Zeyu Gao , Chao Zhang , Zihan Sha , Mingyang Sun , Yuchen Zhou , Wenyu Zhu , Wenju Sun , Han Qiu , Xi Xiao

Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identifying semantically similar code in different contexts. Modern methods have progressed from manually…

Artificial Intelligence · Computer Science 2025-09-30 Charles E. Gagnon , Steven H. H. Ding , Philippe Charland , Benjamin C. M. Fung

GraphBinMatch: Graph-based Similarity Learning for Cross-Language Binary and Source Code Matching

Matching binary to source code and vice versa has various applications in different fields, such as computer security, software engineering, and reverse engineering. Even though there exist methods that try to match source code with binary…

Software Engineering · Computer Science 2023-04-11 Ali TehraniJamsaz , Hanze Chen , Ali Jannesari

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However,…

Cryptography and Security · Computer Science 2023-01-16 Ali Al-Kaswan , Toufique Ahmed , Maliheh Izadi , Anand Ashok Sawant , Premkumar Devanbu , Arie van Deursen

An Empirical Study on the Effectiveness of Large Language Models for Binary Code Understanding

Binary code analysis plays a pivotal role in the field of software security and is widely used in tasks such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code,…

Software Engineering · Computer Science 2025-05-01 Xiuwei Shang , Zhenkan Fu , Shaoyin Cheng , Guoqiang Chen , Gangyang Li , Li Hu , Weiming Zhang , Nenghai Yu

Leveraging Artificial Intelligence on Binary Code Comprehension

Understanding binary code is an essential but complex software engineering task for reverse engineering, malware analysis, and compiler optimization. Unlike source code, binary code has limited semantic information, which makes it…

Software Engineering · Computer Science 2022-10-12 Yifan Zhang

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection

Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when…

Software Engineering · Computer Science 2024-03-01 Hao Wang , Zeyu Gao , Chao Zhang , Mingyang Sun , Yuchen Zhou , Han Qiu , Xi Xiao

Can Contrastive Learning Refine Embeddings

Recent advancements in contrastive learning have revolutionized self-supervised representation learning and achieved state-of-the-art performance on benchmark tasks. While most existing methods focus on applying contrastive learning to…

Machine Learning · Computer Science 2024-04-16 Lihui Liu , Jinha Kim , Vidit Bansal

BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding

Function names can greatly aid human reverse engineers, which has spurred the development of machine learning-based approaches to predicting function names in stripped binaries. Much current work in this area now uses transformers, applying…

Machine Learning · Computer Science 2025-02-04 Tristan Benoit , Yunru Wang , Moritz Dannehl , Johannes Kinder

Hard Negative Mixing for Contrastive Learning

Contrastive learning has become a key component of self-supervised learning approaches for computer vision. By learning to embed two augmented versions of the same image close to each other and to push the embeddings of different images…

Computer Vision and Pattern Recognition · Computer Science 2020-12-07 Yannis Kalantidis , Mert Bulent Sariyildiz , Noe Pion , Philippe Weinzaepfel , Diane Larlus

Binary Code Similarity Detection (BCSD) plays a crucial role in numerous fields, including vulnerability detection, malware analysis, and code reuse identification. As IoT devices proliferate and rapidly evolve, their highly heterogeneous…

Software Engineering · Computer Science 2024-10-25 Xiuwei Shang , Li Hu , Shaoyin Cheng , Guoqiang Chen , Benlong Wu , Weiming Zhang , Nenghai Yu

Contrasting Contrastive Self-Supervised Representation Learning Pipelines

In the past few years, we have witnessed remarkable breakthroughs in self-supervised representation learning. Despite the success and adoption of representations learned through this paradigm, much is yet to be understood about how…

Computer Vision and Pattern Recognition · Computer Science 2021-08-20 Klemen Kotar , Gabriel Ilharco , Ludwig Schmidt , Kiana Ehsani , Roozbeh Mottaghi

Improving the Learning of Code Review Successive Tasks with Cross-Task Knowledge Distillation

Code review is a fundamental process in software development that plays a pivotal role in ensuring code quality and reducing the likelihood of errors and bugs. However, code review can be complex, subjective, and time-consuming. Quality…

Software Engineering · Computer Science 2024-02-06 Oussama Ben Sghaier , Houari Sahraoui

A Cross-Architecture Instruction Embedding Model for Natural Language Processing-Inspired Binary Code Analysis

Given a closed-source program, such as most of proprietary software and viruses, binary code analysis is indispensable for many tasks, such as code plagiarism detection and malware analysis. Today, source code is very often compiled for…

Cryptography and Security · Computer Science 2018-12-27 Kimberly Redmond , Lannan Luo , Qiang Zeng

How Far Have We Gone in Binary Code Understanding Using Large Language Models

Binary code analysis plays a pivotal role in various software security applications, such as software maintenance, malware detection, software vulnerability discovery, patch analysis, etc. However, unlike source code, understanding binary…

Software Engineering · Computer Science 2024-10-25 Xiuwei Shang , Shaoyin Cheng , Guoqiang Chen , Yanming Zhang , Li Hu , Xiao Yu , Gangyang Li , Weiming Zhang , Nenghai Yu

Evaluating Disassembly Errors With Only Binaries

Disassemblers are crucial in the analysis and modification of binaries. Existing works showing disassembler errors largely rely on practical implementation without specific guarantees and assume source code and compiler toolchains to…

Cryptography and Security · Computer Science 2025-07-08 Lambang Akbar Wijayadi , Yuancheng Jiang , Roland H. C. Yap , Zhenkai Liang , Zhuohao Liu

Contrastive Instruction Tuning

Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs…

Computation and Language · Computer Science 2024-06-07 Tianyi Lorena Yan , Fei Wang , James Y. Huang , Wenxuan Zhou , Fan Yin , Aram Galstyan , Wenpeng Yin , Muhao Chen

On the Role of Pre-trained Embeddings in Binary Code Analysis

Deep learning has enabled remarkable progress in binary code analysis. In particular, pre-trained embeddings of assembly code have become a gold standard for solving analysis tasks, such as measuring code similarity or recognizing…

Machine Learning · Computer Science 2025-02-14 Alwin Maier , Felix Weissberg , Konrad Rieck

CHAIN: Exploring Global-Local Spatio-Temporal Information for Improved Self-Supervised Video Hashing

Compressing videos into binary codes can improve retrieval speed and reduce storage overhead. However, learning accurate hash codes for video retrieval can be challenging due to high local redundancy and complex global dependencies between…

Computer Vision and Pattern Recognition · Computer Science 2023-11-06 Rukai Wei , Yu Liu , Jingkuan Song , Heng Cui , Yanzhao Xie , Ke Zhou