Related papers: CodeBERT-nt: code naturalness via CodeBERT

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi

CodeBERT: A Pre-Trained Model for Programming and Natural Languages

We present CodeBERT, a bimodal pre-trained model for programming language (PL) and nat-ural language (NL). CodeBERT learns general-purpose representations that support downstream NL-PL applications such as natural language codesearch, code…

Computation and Language · Computer Science 2020-09-21 Zhangyin Feng , Daya Guo , Duyu Tang , Nan Duan , Xiaocheng Feng , Ming Gong , Linjun Shou , Bing Qin , Ting Liu , Daxin Jiang , Ming Zhou

CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Since the rise of neural natural-language-to-code models (NL->Code) that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this…

Software Engineering · Computer Science 2023-11-01 Shuyan Zhou , Uri Alon , Sumit Agarwal , Graham Neubig

INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers

Pre-trained models of source code have recently been successfully applied to a wide variety of Software Engineering tasks; they have also seen some practical adoption in practice, e.g. for code completion. Yet, we still know very little…

Software Engineering · Computer Science 2023-12-11 Anjan Karmakar , Romain Robbes

ContraBERT: Enhancing Code Pre-trained Models via Contrastive Learning

Large-scale pre-trained models such as CodeBERT, GraphCodeBERT have earned widespread attention from both academia and industry. Attributed to the superior ability in code representation, they have been further applied in multiple…

Software Engineering · Computer Science 2023-01-24 Shangqing Liu , Bozhi Wu , Xiaofei Xie , Guozhu Meng , Yang Liu

Enhancing Source Code Classification Effectiveness via Prompt Learning Incorporating Knowledge Features

Researchers have investigated the potential of leveraging pre-trained language models, such as CodeBERT, to enhance source code-related tasks. Previous methodologies have relied on CodeBERT's '[CLS]' token as the embedding representation of…

Computation and Language · Computer Science 2024-09-04 Yong Ma , Senlin Luo , Yu-Ming Shang , Yifei Zhang , Zhengjun Li

Learning code summarization from a small and local dataset

Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of…

Software Engineering · Computer Science 2022-06-03 Toufique Ahmed , Premkumar Devanbu

JavaBERT: Training a transformer-based model for the Java programming language

Code quality is and will be a crucial factor while developing new software code, requiring appropriate tools to ensure functional and reliable code. Machine learning techniques are still rarely used for software engineering tools, missing…

Software Engineering · Computer Science 2021-10-22 Nelson Tavares de Sousa , Wilhelm Hasselbring

Diet Code Is Healthy: Simplifying Programs for Pre-trained Models of Code

Pre-trained code representation models such as CodeBERT have demonstrated superior performance in a variety of software engineering tasks, yet they are often heavy in complexity, quadratically with the length of the input sequence. Our…

Software Engineering · Computer Science 2022-11-22 Zhaowei Zhang , Hongyu Zhang , Beijun Shen , Xiaodong Gu

What do pre-trained code models know about code?

Pre-trained models of code built on the transformer architecture have performed well on software engineering (SE) tasks such as predictive code generation, code summarization, among others. However, whether the vector representations from…

Software Engineering · Computer Science 2021-08-26 Anjan Karmakar , Romain Robbes

GraphCodeBERT: Pre-training Code Representations with Data Flow

Pre-trained models for programming language have achieved dramatic empirical improvements on a variety of code-related tasks such as code search, code completion, code summarization, etc. However, existing pre-trained models regard a code…

Software Engineering · Computer Science 2021-09-14 Daya Guo , Shuo Ren , Shuai Lu , Zhangyin Feng , Duyu Tang , Shujie Liu , Long Zhou , Nan Duan , Alexey Svyatkovskiy , Shengyu Fu , Michele Tufano , Shao Kun Deng , Colin Clement , Dawn Drain , Neel Sundaresan , Jian Yin , Daxin Jiang , Ming Zhou

NatGen: Generative pre-training by "Naturalizing" source code

Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) for source code yielded strong results on several tasks in the past few years, including code generation and translation. These models have adopted varying pre-training…

Programming Languages · Computer Science 2022-07-07 Saikat Chakraborty , Toufique Ahmed , Yangruibo Ding , Premkumar Devanbu , Baishakhi Ray

Code Vulnerability Detection Across Different Programming Languages with AI Models

Security vulnerabilities present in a code that has been written in diverse programming languages are among the most critical yet complicated aspects of source code to detect. Static analysis tools based on rule-based patterns usually do…

Cryptography and Security · Computer Science 2025-08-19 Hael Abdulhakim Ali Humran , Ferdi Sonmez

BERT2Code: Can Pretrained Language Models be Leveraged for Code Search?

Millions of repetitive code snippets are submitted to code repositories every day. To search from these large codebases using simple natural language queries would allow programmers to ideate, prototype, and develop easier and faster.…

Software Engineering · Computer Science 2021-04-19 Abdullah Al Ishtiaq , Masum Hasan , Md. Mahim Anjum Haque , Kazi Sajeed Mehrab , Tanveer Muttaqueen , Tahmid Hasan , Anindya Iqbal , Rifat Shahriyar

Probing Pretrained Models of Source Code

Deep learning models are widely used for solving challenging code processing tasks, such as code generation or code summarization. Traditionally, a specific model architecture was carefully built to solve a particular code processing task.…

Software Engineering · Computer Science 2022-11-18 Sergey Troshin , Nadezhda Chirkova

Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond

Recently, fine-tuning pre-trained code models such as CodeBERT on downstream tasks has achieved great success in many software testing and analysis tasks. While effective and prevalent, fine-tuning the pre-trained parameters incurs a large…

Software Engineering · Computer Science 2023-04-12 Ensheng Shi , Yanlin Wang , Hongyu Zhang , Lun Du , Shi Han , Dongmei Zhang , Hongbin Sun

On the Effectiveness of Transfer Learning for Code Search

The Transformer architecture and transfer learning have marked a quantum leap in natural language processing, improving the state of the art across a range of text-based tasks. This paper examines how these advancements can be applied to…

Software Engineering · Computer Science 2022-08-29 Pasquale Salza , Christoph Schwizer , Jian Gu , Harald C. Gall

Method-Level Bug Severity Prediction using Source Code Metrics and LLMs

In the past couple of decades, significant research efforts are devoted to the prediction of software bugs. However, most existing work in this domain treats all bugs the same, which is not the case in practice. It is important for a defect…

Software Engineering · Computer Science 2023-09-07 Ehsan Mashhadi , Hossein Ahmadvand , Hadi Hemmati

What Do They Capture? -- A Structural Analysis of Pre-Trained Language Models for Source Code

Recently, many pre-trained language models for source code have been proposed to model the context of code and serve as a basis for downstream code intelligence tasks such as code completion, code search, and code summarization. These…

Software Engineering · Computer Science 2022-02-15 Yao Wan , Wei Zhao , Hongyu Zhang , Yulei Sui , Guandong Xu , Hai Jin

Bringing Structure to Naturalness: On the Naturalness of ASTs

Source code comes in different shapes and forms. Previous research has already shown code to be more predictable than natural language as well as highlighted its statistical predictability at the token level: source code can be natural.…

Software Engineering · Computer Science 2025-04-14 Profir-Petru Pârţachi , Mahito Sugiyama