Related papers: SynCoBERT: Syntax-Guided Multi-Modal Contrastive P…

CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training

Recent years have witnessed increasing interest in code representation learning, which aims to represent the semantics of source code into distributed vectors. Currently, various works have been proposed to represent the complex semantics…

Programming Languages · Computer Science 2022-05-05 Xin Wang , Yasheng Wang , Yao Wan , Jiawei Wang , Pingyi Zhou , Li Li , Hao Wu , Jin Liu

Contrastive Code Representation Learning

Recent work learns contextual representations of source code by reconstructing tokens from their context. For downstream semantic understanding tasks like summarizing code in English, these representations should ideally capture program…

Machine Learning · Computer Science 2022-01-10 Paras Jain , Ajay Jain , Tianjun Zhang , Pieter Abbeel , Joseph E. Gonzalez , Ion Stoica

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code…

Software Engineering · Computer Science 2021-09-14 Daya Guo , Shuo Ren , Shuai Lu , Zhangyin Feng , Duyu Tang , Shujie Liu , Long Zhou , Nan Duan , Alexey Svyatkovskiy , Shengyu Fu , Michele Tufano , Shao Kun Deng , Colin Clement , Dawn Drain , Neel Sundaresan , Jian Yin , Daxin Jiang , Ming Zhou

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned widespread attention from both academia and industry. Attributed to the superior ability in code representation, they have been further applied in multiple…

Software Engineering · Computer Science 2023-01-24 Shangqing Liu , Bozhi Wu , Xiaofei Xie , Guozhu Meng , Yang Liu

Enhancing Source Code Classification Effectiveness via Prompt Learning Incorporating Knowledge Features

Researchers have investigated the potential of leveraging pre-trained language models, such as CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT's '[CLS]' token as the embedding representation of…

Computation and Language · Computer Science 2024-09-04 Yong Ma , Senlin Luo , Yu-Ming Shang , Yifei Zhang , Zhengjun Li

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code…

Computation and Language · Computer Science 2020-09-21 Zhangyin Feng , Daya Guo , Duyu Tang , Nan Duan , Xiaocheng Feng , Ming Gong , Linjun Shou , Bing Qin , Ting Liu , Daxin Jiang , Ming Zhou

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to…

Sound · Computer Science 2023-07-06 Chutong Meng , Junyi Ao , Tom Ko , Mingxuan Wang , Haizhou Li

Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations

We propose Corder, a self-supervised contrastive learning framework for source code model. Corder is designed to alleviate the need of labeled data for code retrieval and code summarization tasks. The pre-trained model of Corder can be used…

Software Engineering · Computer Science 2021-05-25 Nghi D. Q. Bui , Yijun Yu , Lingxiao Jiang

CCBERT: Self-Supervised Code Change Representation Learning

Numerous code changes are made by developers in their daily work, and a superior representation of code changes is desired for effective code change analysis. Recently, Hoang et al. proposed CC2Vec, a neural network-based approach that…

Software Engineering · Computer Science 2023-09-28 Xin Zhou , Bowen Xu , DongGyun Han , Zhou Yang , Junda He , David Lo

TreeBERT: A Tree-Based Pre-Trained Model for Programming Language

Source code can be parsed into the abstract syntax tree (AST) based on defined syntax rules. However, in pre-training, little work has considered the incorporation of tree structure into the learning process. In this paper, we present…

Machine Learning · Computer Science 2021-07-16 Xue Jiang , Zhuoran Zheng , Chen Lyu , Liang Li , Lei Lyu

UniXcoder: Unified Cross-Modal Pre-training for Code Representation

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models.…

Computation and Language · Computer Science 2022-03-09 Daya Guo , Shuai Lu , Nan Duan , Yanlin Wang , Ming Zhou , Jian Yin

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE…

Software Engineering · Computer Science 2023-06-07 Yangruibo Ding , Saikat Chakraborty , Luca Buratti , Saurabh Pujar , Alessandro Morari , Gail Kaiser , Baishakhi Ray

Soft-Labeled Contrastive Pre-training for Function-level Code Representation

Code contrastive pre-training has recently achieved significant progress on code-related tasks. In this paper, we present \textbf{SCodeR}, a \textbf{S}oft-labeled contrastive pre-training framework with two positive sample construction…

Computation and Language · Computer Science 2022-10-27 Xiaonan Li , Daya Guo , Yeyun Gong , Yun Lin , Yelong Shen , Xipeng Qiu , Daxin Jiang , Weizhu Chen , Nan Duan

Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of…

Software Engineering · Computer Science 2023-12-04 Weisong Sun , Chunrong Fang , Yun Miao , Yudu You , Mengzhe Yuan , Yuchen Chen , Quanjun Zhang , An Guo , Xiang Chen , Yang Liu , Zhenyu Chen

Code Representation Learning with Pr\"ufer Sequences

An effective and efficient encoding of the source code of a computer program is critical to the success of sequence-to-sequence deep neural network models for tasks in computer program comprehension, such as automated code summarization and…

Artificial Intelligence · Computer Science 2021-11-16 Tenzin Jinpa , Yong Gao

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks. Existing solutions dominantly capitalize on the multi-modal inputs…

Computer Vision and Pattern Recognition · Computer Science 2021-12-15 Jianjie Luo , Yehao Li , Yingwei Pan , Ting Yao , Hongyang Chao , Tao Mei

Code Representation Learning At Scale

Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred…

Computation and Language · Computer Science 2024-02-06 Dejiao Zhang , Wasi Ahmad , Ming Tan , Hantian Ding , Ramesh Nallapati , Dan Roth , Xiaofei Ma , Bing Xiang

SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations

Recent years have seen the successful application of large pre-trained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application…

Software Engineering · Computer Science 2022-05-26 Changan Niu , Chuanyi Li , Vincent Ng , Jidong Ge , Liguo Huang , Bin Luo

AstBERT: Enabling Language Model for Financial Code Understanding with Abstract Syntax Trees

Using the pre-trained language models to understand source codes has attracted increasing attention from financial institutions owing to the great potential to uncover financial risks. However, there are several challenges in applying these…

Artificial Intelligence · Computer Science 2022-10-12 Rong Liang , Tiehua Zhang , Yujie Lu , Yuze Liu , Zhen Huang , Xin Chen